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GO ■ Abstract 

We consider the sampling problem for functional PCA (fPCA), where the simplest ex- 
ample is the case of taking time samples of the underlying functional components. More 
generally, model the sampling operation as a continuous linear map from % to M m , where 
the functional components to lie in some Hilbert subspace % of L 2 , for example an RKHS 

GO ■ 

of smooth functions. This model includes time and frequency sampling as special cases. In 
contrast to classical approach in fPCA in which access to entire functions is assumed, having 
a limited number m of functional samples places limitations on the performance of statistical 
procedures. We study these effects by analyzing the rate of convergence of an M-estimator 
for the subspace spanned by the leading components in a multi-spiked covariance model. 
The estimator takes the form of regularized PCA, and hence is computationally attractive. 
We analyze the behavior of this estimator within a non-asymptotic framework, and provide 
bounds that hold with high probability as a function the number of statistical samples n and 
the number of functional samples m. We also derive lower bounds showing that the rates 
£T) . obtained are minimax optimal. 

cn 

1 Introduction 

The statistical analysis of functional data, commonly known as functional data analysis (FDA), 
is an established area of statistics with a great number of practical applications (e.g., see the 
^ books \27\ 128] and references therein for various examples). When the data is available as finely 

sampled curves, say in time, it is common to treat it as a collection of continuous-time curves 
or functions, each being observed in totality. These datasets are then termed "functional", and 
various statistical procedures applicable in finite dimensions can be extended to this functional 
setting. Among such procedures is principal component analysis (PCA), which is the focus of 
present work. 

If one thinks of continuity as a mathematical abstraction of reality, then treating functional 
data as continuous curves is arguably a valid modeling device. However, in practice, one is faced 
with finite computational resources and is forced to implement a (finite-dimensional) approx- 
imation of true functional procedures by some sort of truncation of functions, for instance in 
the frequency domain. It is then important to understand the effects of this truncation on the 
statistical performance of the procedure. In other situations, for example in longitudinal data 
analysis [12], a continuous curve model is justified as a hidden underlying generating process to 
which one has access only through sparsely sampled, corrupted by noise perhaps, measurements 
in time. Studying how the time-sampling affects the estimation of the underlying functions in 
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the presence of noise has some elements in common with that of the frequency-domain problem 
mentioned above. 

The aim of this paper is to study effects of "sampling" — in a fairly general sense — on func- 
tional principal component analysis in smooth function spaces. We take a functional-theoretic 
approach to sampling by treating the sampling procedure as a (continuous) linear operator. This 
provides us with a notion of sampling general enough to treat both the frequency-truncation and 
time-sampling within a unified framework. We take as our smooth function space a Hilbert 
subspace % of L 2 [0, 1] and denote the sampling operator by $ : H — > E m . We assume that there 
are functions Xi(t), t € [0, 1] in % for i = 1, . . . , n, generated i.i.d. from a probabilistic model (to 
be discussed). We then observe the collection {<I ) Xj}™ =1 C M m in noise. We refer to the index n 
as the number of statistical samples, and to the index m as the number of functional samples. 

We analyze a natural M-estimator which takes the form of a regularized PCA in W 71 and 
provide rates of convergence in terms of n and m. The eigen-decay of two operators govern 
the rates, the product of $ and its adjoint and the product of the map embedding % in L 2 
and its adjoint. Our focus will be on the setting where % is a reproducing kernel Hilbert space 
(RKHS), in which case the two eigen-decays are intimately related through the kernel function 
(s,t) i — y K(s,t). In such cases, the two components of the rate interact and give rise to optimal 
values for the number of functional samples (m) in terms of the number of statistical samples 
(n) or vice versa. This has practical appeal in cases where obtaining either type of samples is 
costly. 

Our model for the functions {x{\ will be an extension to function spaces of the spiked co- 
variance model introduced by Johnstone and his collaborators [171 [18] , and studied by various 
authors (e.g., |18(, [2"^1 [1]). We consider such models with r components, each lying within the 
Hilbert ball B%(/9) of radius p, with the goal of recovering the r-dimensional subspace spanned 
by the spiked components in this functional model. We analyze our M-estimators within a high- 
dimensional framework that allows both the number of statistical samples n and the number of 
functional samples m to diverge together. Our main theoretical contributions are to derive non- 
asymptotic bounds on the estimation error as a function of the pair (m,n), which are shown to 
be sharp (minimax-optimal). Although our rates also explicitly track the number of components 
r and the smoothness parameter p, we do not make any effort to obtain optimal dependence on 
these parameters. 

The general asymptotic properties of PCA in function spaces have been investigated by 
various authors (e.g., [TUl [15] .) Accounting for smoothness of functions by introducing various 
roughness/smoothness penalties is a standard approach, used in the papers [291 [23 [3Q1 E] among 
others. The problem of principal component analysis for sampled functions, with a similar 
functional-theoretic perspective, is discussed by Besse and Ramsey [I] for the noiseless case. A 
more recent line of work is devoted to the case of functional PCA with noisy sampled functions [3 
SHE]. Cardot [8] considers estimation via spline-based approximation, and derives MISE rates 
in terms of various parameters of the model. Hall et al. |16| study estimation via local linear 
smoothing, and establish minimax-optimality in certain settings that involve a fixed number of 
functional samples. Both papers [3 [16] demonstrate trade-offs between the numbers of statistical 
and functional samples; we refer the reader to Hall et al. |16] for an illuminating discussion of 
connections between FDA and LDA approaches (i.e. having full versus sampled functions), which 
inspired much of the present work. We note that the regularization present in our M-estimator 
is closely related to classical roughness penalties [29, 30J in the special case of spline kernels, 
although the discussion there applies to fully-observed functions, as opposed to the sampled 
models considered here. 
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As mentioned above, our sampled model resembles very much that of spiked covariance model 
for high-dimensional principal component analysis. A line of work on this model has treated vari- 
ous types of sparsity conditions on the eigenfunctions [18j 122 DO; in contrast, here the smoothness 
condition on functional components translates into an ellipsoid condition on the vector principal 
components. Perhaps an even more significant difference is that, here, the effective scaling of 
noise in R m is substantially smaller in some cases (e.g., the case of time sampling). This could 
explain why the difficulty of "high-dimensional" setting is not observed in such cases as one 
lets m,n —> oo. On the other hand, a difficulty particular to our sampled model is the lack of 
orthonormality between components (after sampling) which leads to identifiability issues; it also 
makes recovering individual components difficult. In order to derive non- asymptotic bounds on 
our M-estimator, we exploit various techniques from empirical process theory (e.g., [31]), as well 
as the concentration of measure (e.g., [SO])- We also exploit recent work [23] on the localized 
Rademacher complexities of unit balls in a reproducing kernel Hilbert space, as well as techniques 
from non-asymptotic random matrix theory, as discussed in Davidson and Szarek |11| . in order 
to control various norms of random marices. These techniques allow us to obtain finite-sample 
bounds that hold with high probability, and are specified explicitly in terms of the pair (m, n), 
and the underlying smoothness of the Hilbert space. 

The remainder of this paper is organized as follows. Section [2] is devoted to background 
material on reproducing kernel Hilbert spaces, adjoints of operators, as well as the class of 
sampled functional models that we study in this paper. In Section [31 we describe M-estimators 
for sampled functional PCA, and discuss various implementation details. Section |4] is devoted to 
the statements of our main results, and discussion of their consequences for particular sampling 
models. In subsequent sections, we provide the proofs of our results, with some more technical 
aspects deferred to the appendices. Section [5] is devoted to bounds on the subspace-based error, 
whereas Section [6] is devoted to bounds on error in the function space. Section [7] provides 
matching lower bounds on the minimax error, showing that our analysis is sharp. We conclude 
with a discussion in Section [8] 

Notation. We will use ||| • \\hs to denote the Hilbert-Schmidt norm of an operator or a 
matrix. The corresponding inner product is denoted as ((-, •)). If T is an operator on a Hilbert 
space H with an orthonormal basis {ej}, then = ^ ■ ||Tej||^. For a matrix A = (a^), we 

have \\A\\ 2 HS = J2i, j \aij\ 2 - 

2 Background and problem set-up 

In this section, we begin by introducing background on reproducing kernel Hilbert spaces, as well 
as linear operators and their adjoints. We then introduce the functional and observation model 
that we study in this paper, and conclude with discussion of some approximation-theoretic issues 
that play an important role in parts of our analysis. 

2.1 Reproducing Kernel Hilbert Spaces 

We begin with a quick overview of some standard properties of reproducing kernel Hilbert spaces; 
we refer the reader to the books [321 [14] for more details. A reproducing kernel Hilbert space 
(or RKHS for short) is a Hilbert space % of functions / : T —¥ R that is equipped with an 
associated kernel IK : T x T — > R. We assume the kernel to be continuous, and the set Tct l! 
to be compact. For concreteness, we think of T = [0, 1] throughout this paper, but any compact 
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set of WL d suffices. For each t G T, the function Rt := K(-, t) belongs to the Hilbert space Ti, and 
it acts as the representer of evaluation, meaning that (/, RtfH = f(t) f° r an / S %■ 

The kernel K defines an integral operator 7k on L 2 (T), mapping the function / to the 
function g(s) = j T K(s,t)f(t)dt. By the spectral theorem in Hilbert spaces, this operator can 
be associated with a sequence of eigenfunctions i/jk,k = 1,2,... in H, orthogonal in H and 
orthonormal in L 2 (T), and a sequence of non- negative eigenvalues fix > //2 > • • • • Most useful 
for this paper is the fact that any function f £ T~L has an expansion in terms of these eigenfunctions 
and eigenvalues, namely 



oc 



f = ^2VJj^a k ^ k (1) 



k=l 

for some (a^) G In terms of this expansion, we have the representations \\f\\%j = J2T=i a k 
and ||/|| 2 2 = Y^k=i^k(^\- Many of our results involve the decay rate of these eigenvalues: in 
particular, for some parameter a > 1/2, we say that the kernel operator has eigenvalues with 
polynomial-a decay if there is a constant c > such that 

Vk<^ for all A; = 1,2,.... (2) 

Let us consider an example to illustrate. 

Example 1 (Sobolev class with smoothness a = 1). In the case T = [0,1] and a = 1, we 
can consider the kernel function K(s, t) = min{s,t}. As discussed in Appendix lAl this kernel 
generates the class of functions 

H:={f€ L 2 ([0, 1]) | /(0) = 0, / absolutely continuous and /' G L 2 ([0, 1])}. 

The class % is an RKHS with inner product (f,g)n = Jo (*)^> an< ^ the ball B%(p) cor- 

responds to a Sobolev space with smoothness a = 1. The eigen-decomposition of the kernel 
integral operator is 

Mfc = [ (2fc ~ 1)7F ]~ 2 , VfcW = v / 2sin(/i~ 1/2 t), fe = l,2,.... (3) 
Consequently, this class has polynomial decay with parameter a = 1. 

We note that there are natural generalizations of this example to a = 2, 3, . . ., corresponding to 
the Sobolev classes of a-times differentiable functions (e.g., see the books [T4"l 13]). 

In this paper, the operation of generalized sampling is defined in terms of a bounded linear 
operator <3? : T~L — > M. m on the Hilbert space. Its adjoint is a mapping <£* : M m — > H, defined by the 
relation ($/, a}^ = (/, $*a)^ for all / G H and a G M m . In order to compute a representation 
of the adjoint, we note that by the Riesz representation theorem, the j-th coordinate of this 
mapping — namely, / \-t [$f)j — can be represented as an inner product (4>j, f)u, for some element 
4>j £H, and we can write 

*/=[<&,/>* {<h,f)n ■■■ (^m,f)n] T - (4) 

Consequently, we have ($(/), a)]R m = Y^Jj=i a j(<Aj> /)"H = (SjLi a j^j' /)%> so that f° r an y ° G K""' 
the adjoint can be written as 

m 

**o = J^o i i . (5) 

i=i 

This adjoint operator plays an important role in our analysis. 
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2.2 Functional model and observations 

Let s\ > S2 > S3 > • • • > s r > be a fixed sequence of positive numbers, and let {f*}j =1 be 
a fixed sequence of functions orthonormal in L 2 [0, 1]. Consider a collection of n i.i.d. random 
functions {xi, . . . , a?n}, generated according to the model 

r 

x i(t) = S i^y /?*(*)' for i = 1, . . . , n, (6) 
i=i 

where {/%} are i.i.d. iV(0, 1) across all pairs This model corresponds to a finite-rank 

instantiation of functional PCA, in which the goal is to estimate the span of the unknown 
eigenfunctions {fj} T j = i- Typically, these eigenfunctions are assumed to satisfy certain smoothness 
conditions; in this paper, we model such conditions by assuming that the eigenfunctions belong 
to a reproducing kernel Hilbert space 7~L embedded within L 2 [0, 1]; more specifically, they lie in 
some ball in H, 

\\f*\\n<P, j = l,...,r. (7) 

For statistical problems involving estimation of functions, the random functions might only 
be observed at certain times (t%, . . . , t m ), such as in longitudinal data analysis, or we might 
collect only projections of each Xi in certain directions, such as in tomographic reconstruction. 
More concretely, in a time-sampling model, we observe m-dimensional vectors of the form 

y i =[x i {t 1 ) Xi(t 2 ) ■■■ Xi(t m )} +a Wi, for % = 1,2, . . . ,n, (8) 

where {ii, t2, • • • , t m } is a fixed collection of design points, and W{ S W 71 is a noise vector. Another 
observation model is the basis truncation model in which we observe the projections of / onto 
the first m basis functions {V'jljLi of the kernel operator — namely, 

rp 

yi = [(ipi,Xi) L 2 (tp2,Xi) L 2 ••• (tpm,Xi) L 2] +o- Wi, for i = 1,2, ...,n, (9) 

where (•, represents the inner product in -L 2 [0, 1]. 

In order to model these and other scenarios in a unified manner, we introduce a linear operator 
<£ m that maps any function x in the Hilbert space to a vector $ m (x) of m samples, and then 
consider the linear observation model 

Hi = ® m (xi) + a m Wi, for i = 1,2, ... ,n. (10) 

This model (|10p can be viewed as a functional analog of the spiked covariance models introduced 
by Johnstone |17} \18\ as an analytically-convenient model for studying high-dimensional effects 
in classical PCA. 

Both the time-sampling ([8]) and frequency truncation Q models can be represented in this 
way, for appropriate choices of the operator <3? m . Recall the representation (j3J of <3? m in terms of 
the functions 

• For the time sampling model (|8|) . we set <j>j = K(-,tj)/-*/m, so that by the reproducing 
property of the kernel, we have (fy, f)u = f(tj)/y/m for all f £ H, and j = 1,2, ... m. 
With these choices, the operator <3> m maps each / E H to the m-vector of rescaled samples 
7^ ' ' ' /(*m)] • Defining the rescaled noise a m = yields an instantiation of 

the model (jlOD which is equivalent to time-sampling ([8]). 
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• For the basis truncation model Q, we set cj)j = fJ-jipj so that the operator $ maps each 

function / G ~H to the vector of basis coefficients [ (-01, f)i 2 "' {^rmf^L 2 ] ■ Setting 
o~ m = o"o then yields another instantiation of the model (jlOp . this one equivalent to basis 
truncation ([9]). 

A remark on notation before proceeding: in the remainder of the paper, we use (<&, a) as short- 
hand notation for (<3? m ,o" m ), since the index m should be implicitly understood throughout our 
analysis. 

In this paper, we provide and analyze estimators for the r-dimensional eigen-subspace spanned 
by {fj}, in both the sampled domain M m , and in the functional domain. To be more specific, 
for j = 1, . . . , r, define the vectors z*j := <&/* G M m , and the subspaces 

3* :=span{^,...,<} Cl m , and $* := span^*, . . . , / r *} C H, 

and let 3 and 5 denote the corresponding estimators. In order to measure the performance of 
the estimators, we will use projection-based distances between subspaces. In particular, let Py 
and be orthogonal projection operators into 3* and 3, respectively, considered as subspaces 

of l™ := (IR m , || • H2). Similarly, let P$* and P^ be orthogonal projection operators into and 
respectively, considered as subspaces of (7i, \\ ■ H^). We are interested in bounding the deviations 

dHsO,y):=\\P^-Py\\HS, and dns&T) ■= |||^ - *V Whs, (11) 
where ||| • \\hs is the Hilbert-Schmidt norm of an operator (or matrix). 

2.3 Approximation-theoretic quantities 

One object that plays an important role in our analysis is the matrix K := 6 K mxm . From 
the form of the adjoint, it can be seen that [K]ij = (</>£, (f>j)n- F° r future reference, let us compute 
this matrix for the two special cases of linear operators considered thus far. 

• For the time sampling model ([8]), we have 4>j = K(-, tj)/ yjm for all j = 1, . . . , m, and hence 
[K]ij = ^(K(-, ti), K(-, tj)}^ = ^K(ti,tj), using the reproducing property of the kernel. 

• For the basis truncation model ([9]), we have (f>j = ^jifij, and hence [K]ij = {ni^i, l^jipj)'H = 
fiiSij. Thus, in this special case, we have K = diag(/ii, . . . , /x m ). 

In general, the matrix K is a type of Gram matrix, and so is symmetric and positive semidef- 
inite. We assume throughout this paper that the functions {4>j}JL 1 are linearly independent in 
%, which implies that K is strictly positive definite. Consequently, it has a set of eigenvalues 
which can be ordered as 

Ml > £2 > • • • > V™ > 0. (12) 

Under this condition, we may use K to define a norm on M m via \\z\\ 2 K := z T K~ 1 z. Moreover, 
we have the following interpolation lemma, which is proved Appendix lB.il 

Lemma 1. For any f G ~H, we have ||3>/||/f < ll/llwi with equality if and only if f G Ra(<I>*). 
Moreover, for any z G M m , the function g = &*K~ 1 z has smallest Hilbert norm of all functions 
satisfying &g = z, and is the unique function with this property. 
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This lemma is useful in constructing a function-based estimator, as will be clarified in Section [3l 

In our analysis of the functional error d#s(J, a number of approximation-theoretic quan- 
tities play an important role. As a mapping from an infinite-dimensional space T~L to M. m , the 
operator $ has a non-trivial nullspace. Given the observation model (jlOH . we receive no infor- 
mation about any component of a function /* that lies within this nullspace. For this reason, 
we define the width of the nullspace in the L 2 -norm, namely the quantity 

iV m ($):= S u P {||/||| 2 | /€Ker($),||/|| w <l}. (13) 

In addition, the observation operator <I> induces a semi- norm on the space %, defined by 

m 

ll/lll := = J>/g. (14) 

i=i 

It is of interest to assess how well this semi-norm approximates the L 2 -norm. Accordingly, we 
define the quantity 

D m (<t>):= sup HI/HI - ||/||! 2 |, (15) 
/eRa($*) 
II/II«<1 

which measures the worst-case gap between these two (semi)-norms, uniformly over the Hilbert 
ball of radius one, restricted to the subspace of interest Ra(<I>*). Given knowledge of the linear 
operator $, the quantity D m (&) can be computed in a relatively straightforward manner. In 
particular, recall the definition of the matrix K, and let us define a second matrix E S+ with 
entries 8y := ((pi,ifj) L 2. 

Lemma 2. We have the equivalence 

D m ($) = I if - K-^eK-^h, (16) 

where ||| • I2 denotes the £2- operator norm. 

See Appendix IB.2I for the proof of this claim. 

3 M-estimator and implementation 

With this background in place, we now turn to the description of our M-estimator, as well as 
practical details associated with its implementation. 

3.1 M-estimator 

We begin with some preliminaries on notation, and our representation of subspaces. For each 
j = l,...,m, define the vector Zj '■= &fj, corresponding to the image of the function /* 
under the observation operator. We let 3* denote the r-dimensional subspace of M. m spanned 
by where z* = $fj- Our initial goal is to construct an estimate 3, itself an r- 

dimensional subspace, of the unknown subspace 3*- 
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We represent subspaces by elements of the Stiefel manifold V r (M. m ), which consists of of m x r 
matrices Z with orthonormal columns 

V r (R m ) := {Z G R mxr | Z T Z = I r ). 

A given matrix Z acts as a representative of the subspace spanned by its columns, denoted by 
co\{Z). For any U G V r (W r ), the matrix ZU also belongs to the Stiefel manifold, and since 
col(Z) = col(Z?7), we may call ZC/ a version of Z. We let P z = ZZ T G R mxm be the orthogonal 
projection onto col(Z). For two matrices Z\,Z<i G V r (W m ), we measure the distance between the 
associated subspaces via dnsi^i, Z2) := |||-Pzi — -P^I-f/S) where ||| • \\hs is the Hilbert-Schmidt 
(or Frobenius) matrix norm. 

3.1.1 Subspace-based estimator 

With this notation, we now specify an M-estimator for the subspace 3* = span{£*, . . . , z*}. Let 
us begin with some intuition. Given the n samples {yi, . . . , y n }, let us define the m x m sample 
covariance matrix S n := ^Yl^iViyJ '• Given the observation model (jlOh . a straightforward 
computation shows that 

r 

E[X n ] = ^2s]z*(z*) T + a 2 m I m . (17) 
i=i 

Thus, as n becomes large, we expect that the top r eigenvectors of S n might give a good 
approximation to spanjzj, . . . ,z*}. By the Courant-Fischer variational representation, these r 
eigenvectors can be obtained by maximizing the objective function 

((£„, P z )) := tr(S n ZZ T ) 

over all matrices Z G V r (M. m ). 

However, this approach fails to take into account the smoothness constraints that the vectors 
z* = <£/* inherit from the smoothness of the eigenfunctions /*. Since ||"H < P by assumption, 
Lemma [T] implies that 

||4 HI = (zjfK-hl < \\f*f n < p 2 for all j = 1,2, . . . ,r. 

Consequently, if we define the matrix Z* := \z\ ■ ■ ■ z*] G M mXT ', then it must satisfy the trace 
smoothness condition 

r 

((K-\ Z*{Z*) T )) = ^(zjfK-'z* < rp 2 . (18) 
i=i 

This calculation motivates the constraint ((K^ 1 , Pz)} < 2rp 2 in our estimation procedure. 
Based on the preceding intuition, we are led to consider the optimization problem 

ZGarg max {«£„, P z )) \ {{K~\ P z )) < 2rp 2 }, (19) 

where we recall that Pz = ZZ T G R mxm . Given any optimal solution Z, we return the subspace 
3 = col(Z) as our estimate of 3*- As discussed at more length in Section[321 it is straightforward 
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to compute Z in polynomial time. The reader might wonder why we have included an additional 
factor of two in this trace smoothness condition. This slack is actually needed due to the potential 
infeasibility of the matrix Z* for the program (|19p . which arises since the columns Z* are not 
guaranteed to be orthonormal. As shown by our analysis, the additional slack allows us to 
find a matrix Z* G V r (W m ) that spans the same subspace as Z* , and is also feasible for the 
program (|19p . More formally, we have: 

Lemma 3. Under condition i26b\) , there exists a matrix Z* G V r (W rn ) such that 

Ra(Z*) = Ra(Z*), and pT 1 , Z*(Z*) T )) < 2rp 2 . (20) 
See Appendix IB .31 for the proof of this claim. 

3.1.2 The functional estimate # 

Having obtained an estimat^33 = span{J"i, . . . ,%} of 3* = span{,z*, . . . , z*}, we now need to con- 
struct a r-dimensional subspace $ of the Hilbert space as an estimate of = spanj/j*, ...,/*}. 
We do so using the interpolation suggested by Lemma [TJ For each j = l,...,r, define the 
function 

m 

fj := cD*^- 1 ^ = YS. K ~ X Z5)i ( 21 ) 
i=l 

Since K = $<]?* by definition, this construction ensures that $/,• = Zj. Moreover, Lemma [T] 
guarantees that fj has the minimal Hilbert norm (and hence is smoothest in a certain sense) 
over all functions that have this property. Finally, since <3? is assumed to be surjective (equiva- 
lently, K assumed invertible), §*K~ l maps linearly independent vectors to linearly independent 
functions, and hence preserves dimension. Consequently, the space $ := span{/i, . . . , f r } is an 
r-dimensional subspace of T~L which we take as our estimate of 5*- 

3.2 Implementation details 

In this section, we consider some practical aspects of implementing the M-estimator, and present 
some simulations to illustrate its qualitative properties. We begin by observing that once the 
subspace vectors {%K = i have been computed, then it is straightforward to compute the function 

estimates {fj}j—i, as weighted combinations of the functions {4>j} 1 jL 1 . Accordingly, we focus our 
attention on solving the program (|19p . 

On the surface, the problem ()19p might appear non-convex, due to the Stiefel manifold 
constraint. However, it can be reformulated as a semidefinite program (SDP), a well-known 
class of convex programs, as clarified in the following: 

Lemma 4. The problem (|19p is equivalent to solving the SDP 

X G argmax((S n , X)) such that \\X\l 2 < I, tr(X) = r, and ((R- 1 , X)) < 2rp 2 , (22) 



1 Here, {%}£ =1 C R m is any collection of vectors that span 3- As we are ultimately only interested in the 
resulting functional "subspace" , it does not matter which particular collection we choose. 
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for which there always exists an optimal rank r solution. Moreover, by Lagrangian duality, for 
some f3 > 0, the problem is equivalent to 

X e argmax((S n - X)) such that \\X\l 2 < 1 and tr(X) = r, (23) 

which can be solved by an eigendecomposition o/S n — j3K~ l . 

As a consequence, for a given Lagrange multiplier j3, the regularized form of the estimator can 
be solved with the cost of solving an eigenvalue problem. For a given constraint 2rp 2 , the 
appropriate value of j3 can be found by a path-tracing algorithm, or a simple dyadic splitting 
approach. 




Figure 1: Regularized PCA for time sampling in first-order Sobolev RKHS. Top row shows, from 
left to right, plots of the r = 4 "true" principal components /*, . . . , /| with signal-to-noise ratios 
S\ = 1, S2 = 0.5, S3 = 0.25 and S4 = 0.125, respectively. The number of statistical and functional 
samples are n = 75 and m = 100. Subsequent rows show the corresponding estimators fi, ■ ■ ■ , fi 
obtained by applying the regularized form (|23l) . 

In order to illustrate the estimator, we consider the time sampling model ([8]), with uni- 
formly spaced samples, in the context of a first-order Sobolev RKHS (with kernel function 
K(s,t) = min(s,i)). The parameters of the model are taken to be r = 4, (51,52,53,54) = 
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(1, 0.5, 0.25, 0.125), o"q = 1, m = 100 and n = 75. The regularized form (|23p of the estimator is 
applied and the results are shown in Fig. [TJ The top row corresponds to the four "true" signals 
{fj}, the leftmost being /* (i.e. having the highest signal-to-noise ratio.) and the rightmost /|. 

The subsequent rows show the corresponding estimates {fj}, obtained using different values of 
(3. The second, third and fourth rows correspond to [3 = 0, j3 = 0.0052 and [3 = 0.83. 

One observes that without regularization ((3 = 0), the estimates for two weakest signals (/| 
and /|) are poor. The case f3 = 0.0052 is roughly the one which achieves the minimum for the 
dual problem. One observes that the quality of the estimates of the signals, and in particular 
the weakest ones, are considerably improved. The optimal (oracle) value of j3, that is the one 
which achieves the minimum error between {fj} and {fj}, is (3 = 0.0075 in this problem. The 
corresponding estimates are qualitatively similar to those of (3 = 0.0052 and are not shown. 

The case f3 = 0.83 shows the effect of over-regularization. It produces very smooth signals 
and although it fails to reveal /* and /|, it reveals highly accurate versions of /g and f\. It is 
also interesting to note that the smoothest signal, /|, now occupies the position of the second 
(estimated) principal component. That is, the regularized PCA sees an effective signal-to-noise 
ratio which is influenced by smoothness. This suggests a rather practical appeal of the method 
in revealing smooth signals embedded in noise. One can vary j3 from zero upward and if some 
patterns seem to be present for a wide range of f3 (and getting smoother as j3 is increased), one 
might suspect that they are indeed present in data but masked by noise. 



4 Main results 

We now turn to the statistical analysis of our estimators, in particular deriving high-probability 
upper bounds on the error of the subspace-based estimate 3, and the functional estimate 
In both cases, we begin by stating general theorems that applies to arbitrary linear operators 
$ — Theorems [U and [2] respectively — and then derive a number of corollaries for particular in- 
stantiations of the observation operator. 

4.1 Subspace-based estimation rates (for 3) 

We begin by stating high-probability upper bounds on the error <1hs(?>, 3*) of the subspace-based 
estimates. Our rates are stated in terms of a function that involves the eigenvalues of the matrix 
K = G R m , ordered as jEti > jEt2 > • • • > V>m > 0- Consider the function T : R + — > R + given 
by 

m 1/2 

F{t) : = [^min{t 2 ,rp 2 ^}] . (24) 

As will be clarified in our proofs, this function provides a measure of the statistical complexity 
of the function class Ra($*) = {/ G H \ f = Y0j=\ a j$i for some a e Rm }- 

We require a few regularity assumptions. Define the quantity 

C m (f) := , max | (/*,//)$ - <%| = max \(z*,z*) R m - 5ij\, (25) 

1 < i,j < r 1 < y < r 

which measures the departure from orthonormahty of the vectors z* := <&fj in M. m . A straight- 
forward argument using a polarization identity shows that C m (f*) is upper bounded (up to a 
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constant factor) by the uniform quantity D m (<]?), as denned in equation (|15p . Recall that the 
random functions are generated according to the model X{ = Z~2 r j=i s jPijfji where the signal 
strengths are ordered as 1 = si > s 2 > ■ ■ ■ > s r > 0, and that a m denotes the noise standard 
deviation in the observation model (1101). 



In terms of these quantities, we require the following assumptions: 



s 2 1 

(Al) -| > -, and a 2 := sup cr^ < ks\, (26a) 

S^ 2 m 

(A2) C m (n < i, and (26b) 
2r 

(A3) —^F(t) < yfnt for the same constant k as in (Al). (26c) 
Jn 



(A4) r<min \--, K ^- . (26d) 
12 4 cr m J 



Remarks: The first part of condition (Al) is to prevent the ratio s r / s\ from going to zero 
as the pair (m,n) increases, where the constant 1/2 is chosen for convenience. Such a lower 
bound is necessary for consistent estimation of the eigen-subspace corresponding to {si, . . . , s r }. 
The second part of condition (Al), involving the constant k, provides a lower bound on the 
signal-to-noise ratio s r /a m . Condition (A2) is required to prevent degeneracy among the vectors 
z* = obtained by mapping the unknown eigenfunctions to the observation space M. m . (In 
the ideal setting, we would have C m (/*) = 0, but our analysis shows that the upper bound in 
(A2) is sufficient.) Condition (A3) is required so that the critical tolerance e mi „ specified below 
is well-defined; as will be clarified, it is always satisfied for the time-sampling model, and holds 
for the basis truncation model whenever n > m. Condition (A4) is easily satisfied, since the RHS 
of (I26dp goes to oo while we usually take r to be fixed. Our results, however, hold if r grows 
slowly with m and n subject to (|26d|) . 

Theorem 1. Under conditions (Al) — (A3) for a sufficiently small constant k, let e m ^ n be the 
smallest positive number satisfying the inequality 

^r'/ 2 F{e)<Ke 2 . (27) 

Then there are universal positive constants (co,ci,C2) such that 

P[d|rs(3,3*) <co4„] >l-y(n,e m>n ), (28) 

where <p(n,e m>n ) := ci{r 2 exp ( - c 2 r~ 3 -^(e m ^ f\e 2 mn j) +rexp(-|j)}. 

We note that Theorem[T]is a general result, applying to an arbitrary bounded linear operator 
<£. However, we can obtain a number of concrete results by making specific choices of this 
sampling operator, as we explore in the following sections. 

4.1.1 Consequences for time-sampling 

Let us begin with the time-sampling model (JSj) , in which we observe the sampled functions 
yi=[xi{h) Xi(t 2 ) ... Xi(t m )] T + o-QWi, for i = 1,2, ... ,m. 
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As noted earlier, this set-up can be modeled in our general setting (fTOj) with cpj = K.(-,tj)/y/m 
and a m = a Q /^/m. 

In this case, by the reproducing property of the RKHS, the matrix K = <3?<I>* has entries of 
the form ify = (fc, (j)j)u = K<yt ^ 3 - ■ Letting jEti > fa > ■ ■ ■ > ju m > denote its ordered eigenval- 
ues, we say that the kernel matrix K has polynomial-decay with parameter a > 1/2 if there is a 
constant c such that flj < cj~ 2a for all j = 1, 2, . . . , m. Since the kernel matrix K represents a 
discretized approximation of the kernel integral operator defined by K, this type of polynomial 
decay is to be expected whenever the kernel operator has polynomial-a decaying eigenvalues. 
For example, the usual spline kernels that define Sobolev spaces have this type of polynomial 
decay |14| . In Appendix |Aj we verify this property explicitly for the kernel K(s,i) = min{s,£} 
that defines the Sobolev class with smoothness a = 1. 



For any such kernel, we have the following consequence of Theorem [TJ 

Corollary 1 (Achievable rates for time-sampling). Consider the case of a time- sampling operator 
In addition to conditions (Al) and (A2), suppose that the kernel matrix K has polynomial- 
decay with parameter a > 1/2. Then we have 

"d^(3,3*)<c min{(^0)^,r 3 ^}l >l-y,(n,m), (29) 

where k T:P := r 3+ 2^ p& 7 and ip(n,m) := ci{ exp ( — C2{ (r~ 2 p 2 mn) 2a+1 A m}) + exp(— n/64)}. 

Remarks: (a) Disregarding constant pre-factors not depending on the pair (m,n), Corollary[T] 
guarantees that solving the program (|19p returns a subspace estimate 3 such that 

d#s(3,3*) ^ min {(raji)~ ! °+ 1 , ra -1 } with high probability as (m,n) increase. 

Depending on the scaling of the number of time samples m relative to the number of functional 
samples n, either term in this upper bound can be the smallest (and hence active) one. For 
instance, it can be verified that whenever m > then the first term is smallest, so that 

we achieve the rate d^ 5 (3,3*) ^ (mn) 2a + 1 . The appearance of the term (mn) 2a + 1 is quite 
natural, as it corresponds to the minimax rate of a non-parameteric regression problem with 
smoothness a, based on m samples each of variance n . Later, in Section 14.31 we provide 
results guaranteeing that this scaling is minimax optimal under reasonable conditions on the 
choice of sample points (in particular, see Theorem [3j a)). 

(b) To be clear, although the bound (I29h allows for the possibility that the error is of order 
lower than n , we note that the probability with which the guarantee holds includes a term 
of the order exp (—n/64). Consequently, in terms of expected error, we cannot guarantee a rate 
faster than n -1 . 



Proof. We need to bound the critical value e m ^ n defined in the theorem statement (|27p . Define 
the function Q 2 (t) := ^Zj=i rmn{/Ij, t 2 }, and note that ^(t) = V^pG{'^ ) ) by construction. 
Under the assumption of polynomial-a eigendecay, we have 

roc 

g 2 (t) < / mm{cx- 2a ,t 2 }dx, 
Jo 
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and some algebra then shows that Q(t) ^ t 1 1 /( 2a ). Disregarding constant factors, an upper 
bound on the critical e m ,n can be obtained by solving the equation 



n y/rp 

2 l 2q 

Doing so yields the upper bound e 2 ^ [^r^Cv^"/ 9 ) "] 2a+1 . Otherwise, we also have the trivial 
upper bound J~{t) < \/rnt, which yields the alternative upper bound E m ,n 

~ (^r 3 ) 1/2 . Recall- 
ing that <7 m = Co/ an d combining the pieces yields the claim. Notice that this last (trivial) 
bound on J-(t) implies that condition (A3) is always satisfied for the time-sampling model. □ 

4.1.2 Consequences for basis truncation 

We now turn to some consequences for the basis truncation model ([9]). 

Corollary 2 (Achievable rates for basis truncation). Consider a basis truncation operator $ in 
a Hilbert space with polynomial-a decay. Under conditions (Al), (A2) and m < n, we have 

2 

F[d 2 HS 0,y) < co (^^)^t] > 1 - <p(n,m), (30) 
where K r>p := r 3+ a^ p«, and <p(n,m) := ci{ exp ( — C2(r~ 2 p 2 n) 2a + 1 ^j + exp(— n/64)}. 



Proof. We note that as long as m < n, condition (A3) is satisfied, since ^^J-(t) < °~o\/^t < o~ot. 
The rest of the proof follows that of Corollary [H noting that in the last step we have a m = gq 
for the basis truncation model. □ 

4.2 Function-based estimation rates (for 

As mentioned earlier, given the consistency of 3, the consistency of # is closely related to ap- 
proximation properties of the semi-norm || • ||$ induced by $, and in particular how closely it 
approximates the L 2 -norm. These approximation-theoretic properties are captured in part by 
the nullspace width N m ($) and defect Z? m (3>) defined earlier in equations (|13| and (fT5|) respec- 
tively. In addition to these previously defined quantities, we require bounds on the following 
global quantity 

iU^):=sup{||/||i 2 | \\f\\ 2 H <v 2 , ||/|||<e 2 }. (31) 

A general upper bound on this quantity is of the form 

R m (e;u) < Cl e 2 + v 2 S m {$>). (32) 

In fact, it is not hard to show that such a bound exists with c\ = 2 and S m (<&) = 2(Z? m (<3?) + 
2V m ($)) using the decomposition H = Ra(<I>*) © Ker(<l>). However, this bound is not sharp. 
Instead, one can show that in most cases of interest, the term S m ($>) is of the order of N m (&). 

There are a variety of conditions that ensure that <S m (<&) has this scaling; we refer the reader 
to the paper [2] for a general approach. Here we provide a simple sufficient condition, namely: 

(Bl) G * c K 2 (33) 

for a positive constant cq. 
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Lemma 5. Under (Bl), the bound (3S\) holds with c\ = 2cq and 5 m ($) = 2N m (Q). 

See Appendix IB.4I for the proof of this claim. In the sequel, we show that the first-order Sobolev 
RKHS satisfies the condition (Bl). 

Theorem 2. Suppose that condition (Al) holds, and the approximation-theoretic quantities sat- 
isfy the bounds D m (&) < < 1 and N m (&) < 1. Then there is a constant k' t p such that 

d 2 HS (Z,r) < < p {4,n + S m @) + [D m (<Z>)} 2 } (34) 

with the same probability as in Theorem [0 

As with Theorem [H this is a generally applicable result, stated in abstract form. By special- 
izing it to different sampling models, we can obtain concrete rates, as illustrated in the following 
sections. 

4.2.1 Consequences for time-sampling 

We begin by returning to the case of the time sampling model ©, where <f>j = 'K(-,tj)/y/rn. 
In this case, condition (Bl) needs to be verified by some calculations. For instance, as shown 
in Appendix [A] in the case of the Sobolev kernel with smoothness a = 1 (namely, K(s,i) = 
min{s,i}), we are guaranteed that (Bl) holds with cq = 1, whenever the samples {tj} are chosen 
uniformly over [0,1]; hence, by Lemma (5[ S m (&) = 2N m (<&). Moreover, in the case of uniform 
sampling, we expect that the nullspace width N m ($) is upper bounded by [im+i, so will be 
proportional to m~ 2a in the case of a kernel operator with polynomial-a decay. This is verified 
in [2] (up to a logarithmic factor) for the case of the first-order Sobolev kernel. In Appendix |A[ 
we also show that, for this kernel, [-D m (<l>)] 2 is of the order m~ 2a , that is, of the same order as 
N m (<S>). 

Corollary 3. Consider the basis truncation model (J9|) with uniformly spaced samples, and as- 
sume condition (Bl) holds and that N m ($) + [D m (Q)] 2 ^ m~ 2a . Then the M -estimator returns 
a subspace estimate $ such that 

2 2 i 

d 2 HS §,T)<K' riP {mm{(^-)^, ^} + ^} (35) 

with the same probability as in Corollary [7J 

In this case, there is an interesting trade-off between the bias or approximation error terms 
which is of order m~ 2a and the estimation error. An interesting transition occurs at the point 
when m £3 nas, at which: 

• the bias term m~ 2a becomes of the order n _1 , so that it is no longer dominant, and 

• for the two terms in the estimation error, we have the ordering 

— 2a I 1 r 1 \ 2q ^ 

(ran) 2 «+ 1 < (n + 2 Q ) 2^+1 = n 1 

Consequently, we conclude that the scaling m = is the minimal number of samples such 
that we achieve an overall bound of the order n _1 in the time-sampling model. In Section [4.3^ 
we will see that these rates are minimax-optimal. 
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4.2.2 Consequences for basis truncation 

For the basis truncation operator $, we have = K 2 = diag(/zf , . . . , /i^J so that condition (Bl) 
is satisfied trivially with cq = 1. Moreover, Lemma [2] implies D m (<&) = 0. In addition, a function 
/ = Yl'jLi \/~H a i' t l ) i satisfies $>/ = if and only if a\ = 02 = • • • = a m = 0, so that 

iV m (*)=SUp{[|/[|| a I ||/||«<1, $/ = 0} = Hm+l- 

Consequently, we obtain the following corollary of Theorem [2j 

Corollary 4. Consider the basis truncation model ([9]) with a kernel operator that has polynomial- 
en decaying eigenvalues. Then the M -estimator returns a function subspace estimate ^ such that 

d%s(lr) < + ^} (36) 

with the same probability as in Corollary^ 

By comparison to Corollary [3j we see that the trade-offs between (m, n) are very different 
for basis truncation. In particular, there is no interaction between the number of functional 
samples m and the number of statistical samples n. Increasing m only reduces the approximation 
error, whereas increasing n only reduces the estimation error. Moreover, in contrast to the time 
sampling model of Corollary [3l it is impossible to achieve the fast rate n , regardless of how 
we choose the pair (m, n). In Section [4. 3\ we will also see that the rates given in Corollary [J] are 
minimax optimal. 

4.3 Lower bounds 

We now turn to lower bounds on the minimax risk, demonstrating the sharpness of our achievable 
results in terms of their scaling with (m,n). In order to do so, it suffices to consider the simple 
model with a single functional component /* 6 B-^(l), so that we observe yi = fin & m (f*)+o- m Wi 
for % = 1,2, ... ,n, where fin ~ N(0, 1) are i.i.d. standard normal variates. The minimax risk 
over the unit ball of the function space T~L in the $-norm is given by 

M* n (\\-U):=w£ sup E||/-/*|H, (37) 
/ /*eB w (i) 

where the function /* ranges over the unit ball B%(1) = {/ E H \ ||/||% < 1} of some Hilbert 
space, and / ranges over measurable functions of the data matrix (2/1,2/2, • • • , Vn) £ M mxn . 

Theorem 3 (Lower bounds for [|/ — /*||$). Suppose that the kernel matrix K has eigenvalues 
with polynomial-a decay and (Al) holds. 

(a) For the time-sampling model, we have 

MlM\-U)>C^{{—)^^ -}• (38) 

mn n 1 

1 

(b) For the frequency-truncation model, with m > (con) 2a + 1 , we have: 

M£,n{\\ ■ h) > c . (39) 
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Note that part (a) of Theorem [3] shows that the rates obtained in Corollary [3] for the case of 
time-sampling are minimax optimal. Similarly, comparing part (b) of the theorem to Corollary HI 
we conclude that the rates obtained for frequency truncation model are minimax optimal for 
n G [m, cim 2a+1 ]. As will become clear momentarily (as a consequence of our next theorem), 
the case n > c\m 2a+1 is not of practical interest. 

We now turn to lower bounds on the minimax risk in the || • ||^2 norm — namely 

M^ n (8 2 ; || • || L2 ) := inf sup E||/ - /*||£ 2 . (40) 

/ /*6B«(1) 

Obtaining lower bounds on this minimax risk requires another approximation property of the 
norm || • ||<j> relative to || • ||^2. Recall the the matrix G K mxm with entries := 
Since the eigenfunctions are orthogonal in L 2 , the deviation of ^ from the identity measures 
how well the inner product defined by $ approximates the L 2 -inner product over the first m 
eigenfunctions of the kernel operator. For proving lower bounds, we require an upper bound of 
the form 

(B2) A max (^) < ci, 

for some universal constant c\ > 0. As the proof will clarify, this upper bound is necessary in 
order that the Kullback-Leibler divergence — -which controls the relative discriminability between 
different models — can be upper bounded in terms of the L 2 -norm. 

Theorem 4 (Lower bounds for ||/ — /*|| 2 2)- Suppose that condition (B2) holds, and the operator 
associated with kernel function K of the reproducing kernel Hilbert space T~L has eigenvalues with 
polynomial-a- decay. 

(a) For the time- sampling model, the minimax risk is lower bounded as 

<(l'lb)> C{min{(^-)^, ^} + (-) 2 "}. (41) 

(b) For the frequency-truncation model, the minimax error is lower bounded as 

MlM-h*)>C{(^)^ + (±r}. (42) 

Verifying condition (B2) requires, in general, some calculations in the case of time-sampling 
model. It is verified for uniform time-sampling for the first-order Sobolev RKHS in Appendix [Al 
For the frequency-truncation model, condition (B2) always holds trivially since = I m . By this 
theorem, the 1? convergence rates of Corollary [3] and U] are minimax optimal. Also note that 
due to the presence of the approximation term m in (142H . the <3?-norm term n 2a + 1 is only 
dominant when m > C2n 2a + 1 implying that this is the interesting regime for Theorem [3^b). 

5 Proof of subspace-based rates 

We now turn to the proofs of the results involving the error d#s(3,3*) between the estimated 3 
and true subspace 3*- We begin by proving Theorem [U and then turn to its corollaries. 
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5.1 Preliminaries 



We begin with some preliminaries before proceeding to the heart of the proof. Let us first 
introduce some convenient notation. Consider the n x m matrices 

Y := [yi 2/2 • • • Un] , and W := [wi w 2 ■■■ w n ] , 

corresponding to the observation matrix Y and noise matrix W respectively. In addition, 
we define the matrices B := (/3jj) G IR nxr and S := diag(si, . . . , s r ) G W xr . Recalling 
that Z* := (z*, . . . ,z*) G W nxr , the observation model (jlpp can be written in the matrix form 
Y = B(Z*S) T + a m W. Moreover, let us define the matrices B := ^ G W xr and W := 
B G M mxr . Using this notation, some algebra shows that the associated sample covariance 
S n := ^Y T Y can be written in the form 



n 



Z* SBS(Z*) T +A X + A 2 , (43) 



where A x := a m [WS(Z*) T + Z*SW T ] and A 2 := (J 2 J^^. 

Lemma [3l proved in Appendix IB. 31 shows the existence of a matrix Z* G VriW™) such that 
Ra(Z*) = Ra(Z*). As discussed earlier, due to the nature of the Steifel manifold, there are 
many versions of this matrix Z*, and also of any optimal solution matrix Z, obtained via right 
multiplication with an orthogonal matrix. For the subsequent arguments, we need to work with 
a particular version of Z* (and Z) that we describe here. 

Now let us fix some convenient versions of Z* and Z. As a consequence of CS decomposition, 
as long as r < m/2, there exist orthogonal matrices U,V G W xr and an orthogonal matrix 

q £ R mxrn guch th&t 





Q 1 Z*U=\0\, and Q 1 ZV = \S \, (44) 



where C = diag(ci,--- ,c r ) and S = diag(si,--- ,s r ) such that 1 > si > • • • > s r > and 
C 2 + S 2 = I r . (See Bhatia [5j, Theorem VII. 1.8, for details on this decomposition.) In the 
analysis to follow, we work with Z*U and ZV instead of Z* and Z. To avoid extra notation, 
from now on, we will use Z* and Z for these new versions, which we refer to as properly aligned. 
With this choice, we may assume U = V = I r in the CS decomposition (|44l) . 

The following lemma isolates some useful properties of properly aligned subspaces: 

Lemma 6. Let Z* and Z be properly aligned, and define the matrices 

P := - P^, = ZZ T - Z*(Z*) T , and E:=Z-Z*. (45) 
In terms of the CS decomposition (|44p , we have 

\\E\\hs < \\P\\hs, (46a) 
(Z*) T {P^ - P 2 )Z* = S 2 , and (46b) 
d 2 HS (Z,Z*) = \\P^ - P 2 f HS = 2\\S 2 \\ 2 HS + 2\\CS\l 2 HS = 2Y J sl(s 2 k + cl) = 2tr(5 2 ). (46c) 
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^ ^ is 2 — CS o\ 

Proof. From the CS decomposition ([31]). we have Z*(Z*) T - Z(Z) T = Q( _s<3 _,$2 \Q T ', from 

V o o 0/ 

which relations (|46b|) and (|46cp follow. From the decomposition (|44jl and the proper alignment 
condition U = V = I r , we have 

\\Ef HS = \\Q T (Z - Z*)f HS = \jl r - Cf HS + \\Sf HS 

r r r 

= 2^(1- Ci) < 2 £(1 - c?) = 2^>? = \l p \lns (47) 

i=l i=l i=l 

where we have used the relations C 2 + S 2 = I r , Ci 6 [0, 1], and 2tr(S" 2 ) = ||P=* — Pg \lfj S ■ 

□ 



5.2 Proof of Theorem [T] 

Using the notation introduced in Lemma [6j our goal is to bound the Hilbert-Schmidt norm 
|||P|||-f/s- Without loss of generality we will assume s± = 1 throughout. Recalling the defini- 
tion (|43p of the random matrix A, the following inequality plays a central role in the proof: 

Lemma 7. Under condition (Al) and s\ = 1, we have 

\\Pf HS < 128 «P, Ai + A 2 )) (48) 
with probability at least 1 — exp(— n/32). 

Proof. We use the shorthand notation A = Ai + A2 for the proof. Since Z* is feasible and Z is 
optimal for the program (fl"9j) . we have the basic inequality ((S n , Pj?»)) < ((S n , Pg))- Using the 
decomposition £ = T + A and rearranging yields the inequality 

((r, P^-P S ))<((A, P S -Pz«)). (49) 

From the definition f|43|) of T and Z* = Z*P, the left-hand side of the inequality (|49p can be 
lower bounded as 

«T, P^ - P 2 )) = ((B, SR T {Z*) T (Pz, - P S )Z*RS)) 
= t?BSR T S 2 RS 

> X min (B)X min (S 2 )X min (R T R)ti(S 2 ) 

where we have used (f89l) and (j90j) of Appendix[U several times. We note that X m - m (S 2 ) = s 2 > \ 
and A m i n (P T P) > | provided rC m (f*) > |; see equation To bound the minimum eigenvalue 
of P, let 

Tmin(P) denote the minimum singular value of the n x r Gaussian matrix B. The 
following concentration inequality is well-known (cf. [Ill 20J): 

P[7min(P) < y/n - \fr - t] < exp(-t 2 /2), for all t > 0. 

Since A m i n (i3) = 7^ in (-P / \/n) , we have that A m i n (P) > (1— y/r/n — t) 2 with probability at least 
1 — exp(— nt 2 /2). Assuming r/n < ^ and setting t = j, we get A m i n (-B) > ^ with probability at 
least 1 — exp(— n/32). Putting the pieces together yields the claim. □ 
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The inequality ()48f) reduces the problem of bounding l-Plj^ to the sub-problem of studying 
the random variable {{P, A1 + A2)). Based on Lemma [71 our next step is to establish an 
inequality (holding with high probability) of the form 

«P, A 1 + A 2 }}<c 1 {^r 3 / 2 J-(||| J B|||^)+4^5 + 4,n} ) (50) 
y n 

where c\ is some universal constant, k is the constant in condition (Al), and e m)Tl is the critical 
radius from Theorem [TJ Doing so is a non-trivial task: both matrices P and A are random and 
depend on one another, since the subspace Z was obtained by optimizing a random function 
depending on A. Consequently, our proof of the bound (|50p involves deriving a uniform law of 
large numbers for a certain matrix class. 

Suppose that the bound (fSUj) holds, and that the subspaces Z* and Z are properly aligned. 
Lemma [6] implies that ||| -Elfins < |||-P|||hs, an d since J 7 is a non-decreasing function, the inequal- 
ity (j50|) combined with Lemma [7J implies that 

(1 - 128 KCl )|||P|||^ < Cl {^r 3 / 2 T(\jP\j HS ) 

from which the claim follows as long as k is suitably small (for instance, k < ^ suffices). 
Accordingly, in order to complete the proof of Theorem Q3 it remains to prove the bound (|50p , 
and the remainder of our work is devoted to this goal. Given the linearity of trace, we can bound 
the terms ((P, Ai)) and ((P, A2)) separately. 

5.2.1 Bounding {{P, A t )) 

Let {~Zj}, {z*} and {e^} and {wj} denote the columns of Z, Z* , E and W, respectively, 
where we recall the definitions of these quantities from equation (|43p and Lemma [6l Note 
that Wj = n~ l w iPij- In Appendix lC.lt we show that 

{{P, AO) < V6ar 3 / 2 mzx\(w k ,ej)\ + J^r\lE\f HS max|(%,2£)|. (51) 

Consequently, we need to obtain bounds on quantites of the form | (Wj , v) \ , where the vector v is 
either fixed (e.g., v = z*) or random (e.g., v = ej). The following lemmas provide us with the 
requisite bounds: 

Lemma 8. We have 

maxar 3 / 2 \(w k ,e 3 )\ < C {^r 3 / 2 T(\\E\\ H s) + 4Ef HS + Ke 2 m A 

with probability at least 1 — c±r exp(— K 2 r~ 3 n -^r) — rexp(— n/64). 
Lemma 9. We have 

P[max ar\uJ^z*\ < V6k\ > 1 — r 2 exp(— K 2 r~ 2 n/2a 2 ). 

See Appendix IC.2I and lC.3l respectively, for the proofs of these claims. 



20 



5.2.2 Bounding ((P, A 2 » 

Recalling the definition ()43|) of A2 and using linearity of the trace, we obtain 
« P < A ^» = E {&) T W T W- Zj - {~z*) T W T Wz*). 

3=1 

Since ej = Zj — z*j , we have 

«P, A 2 » = a 2 j2{2{T 3 ) T Q-W T W-I r )e J + + 2(2*f%} 

<^E {2 gfQ^V - ^ ^-WeAl }, (52) 

Zi T 2 (e,0 

where we have used the fact that 2 5^(2-) T e,- = 2 E^-Kip^—l] = 2 Ej(cj- 1 ) = HII-E& < °- 
The following lemmas provide high probability bounds on the terms T\ and T%. 

Lemma 10. We have the upper bound 



r 




with probability 1 — C2exp(— K 2 r 2 n ^'i^™'™ ) ~~ ?*exp(— n/64). 

Lemma 11. We have the upper bound a 2 Ej=i ^~2(ej) < Ck{ + e^} u>ii/i probability at 
least 1 — C3 exp(— n 2 r~ 2 n n /2a 2 ). 

See Appendices IC.4I and IC.51 respectively, for the proofs of these claims. 



6 Proof of functional rates 

We now turn to the proof of Theorem [2j which provides upper bounds on the estimation error 
in the function domain. As in the proof of Theorem [H let Z = (Si,-- - ,z r ) E V r (M. m ) and 
Z* = (2|, • • • ,z*) £ VriW 71 ) represent the subspaces 3 and 3* respectively, and assume that they 
are properly aligned (see LemmaE]). For j = 1, . . . ,m, define gj := (^*K~ 1 Zj and := <S>*K Zj. 

Let {hj}j =1 be any basis of J, orthonormal in L 2 , and similarly, let {hj}^ =l be any orthonormal 
basis of 5*- Our goal is to bound the Hilbert-Schmidt norm \\P^ — III hs • ^ n or der to do so, we 
first observe that 

r 

III^-^IIIhs< 2 £ ft- ( 53 ) 

i=i 

so that it suffices to upper bound YTj=i ~ ^jlli 2 ' ^ e Te ^ e this quantity to the functions gj 
and g*j via the elementary inequality 

\\hj ~ h*f L2 < 4{[|& - g*f L 2 + Whj-gjWl* + \\g* - h*\\ 2 L2 }. (54) 
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The remainder of our proof is focused on obtaining suitable upper bounds on each of these three 
terms. 



We begin by bounding the first term \\gj — gj\\jji- Recall the definitions of R m (e;is) and 
S m ($) and their relation via inequality ([32]) . We exploit the inequality in the following way: 
suppose that we can show that 

r r 

52\\9j-9j\\% <A 2 , and £ ||& - g) f u < B 2 . (55) 

3=1 3=1 

Let S(A, B) = {(a, b) G W x W | YJ' j= i a ) < A ' \ E$=i h ) < B 2 }- We may then conclude that 

r r 

yZWdj - 9*j\\ 2 L 2 < sup y^Rraia^bj) 
j= l ' ' (a,b)eS(A,B) j=1 

< sup Y\{cia 2 + b)S m {®)} 

(a,b)eS(A,B)pi 

= Cl A 2 + B 2 S m {<5>). (56) 

where inequality (i) follows by repeated application of inequality (f32|) . 

It remains to establish upper bounds of the form f)55|) . By definition, we have </j ~9*j£ Ra(<3?*) 
and ^(jjj — g*j) = Zj — z*. Recalling the norm ||a||^ := a T K~ 1 a, we note that the matrices Z and 
Z* satisfy the trace smoothness condition YJj=i 11% Hi" = ZZ T )) ^ 2r/9 2 , and hence 

Eii?i-^ii^=Eii%-^ni < 2E(ii%iIk+ii^Hk) < 

3=1 3=1 3=1 B 2 

Furthermore, recalling that ||/||$ = ||$/||2, we have 

r r 

/, Wdj - 9j\\% = H% ~~ *j Hi = 1^ ~ ^llks ^ ll-Pg ~ P z* ^HS 
3=1 3=1 V v ' 

Consequently, by the bound (f56|) with A 2 = — -Pg-llfS an< ^ = 8?"p 2 , we conclude that 

r 

- 9*^ <c 1 \\P $ -Py\l 2 HS + 8rp 2 S m (^) (57) 

3=1 



We now need to bound the remaining two terms in the decomposition (J54J). In order to 
do so, we exploit the freedom in choosing the orthonormal families {hj} T j =1 and {/i^}^ =1 . By 
appropriate choices, we obtain the following results: 

Lemma 12. There exists an orthonormal basis {hj}j =1 of $ for which 

r 

£ [ft -?illi»=^P 4 (58) 

3=1 
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Lemma 13. There exists an orthonormal basis {h*} r j =1 of for which 

r 

^2\\h*j - g^h <C2r 2 C 2 m (f) + 6rp 2 S m (<S>). (59) 
i=i 

As these proofs are more technical and lengthy, we defer them to Appendices ID. II and ID.2I 
respectively. 

Combining all of the pieces, we obtain the upper bound 

11^ " PrlHS < Cs{l^ - Pyf HS + r 2 p A D 2 m {<S>) + r 2 C 2 m {f) + r p 2 S m (<S>)} . (60) 
By using polarization identity and decomposition % = Ra(<3?*) © Ker($), one can show that 

Cm(f*) < Kp(D n ($)+N m ($j), (61) 

when N m (&) < 1. (See Appendix IB.5I for more details.) Using this inequality and noting that 
S m {$) > N m (<&) > {N m (<S>)} 2 when N m (<S>) < 1, the bound (JBOJ) can be simplified to the form 
given in Theorem [2j 



7 Proof of minimax lower bounds 

We now turn to the proofs of the minimax lower bounds stated in Theorems [3] and [H We begin 
with some preliminary results that apply to both proofs. 

7.1 Preliminary results 

Our proofs proceed via a standard reduction from estimation to multi-way hypothesis testing 
(e.g., [351 [33]). I n particular, let {f 1 , . . . , f M } be an <5-packing set of B-^(l) in a given norm || • ||*. 
(For our proofs, this norm will be either || • [|$ or || • ||^2.) Given such a packing set, it is known 
that the minimax error in the norm || • ||* can be lower bounded, using Fano's inequality, by 

■nf sup P/ . [|7- ril >?1>1- Iir iiV,° t2 - (^) 



/ /*GB„(1) 



logM 



where y = (y±, . . . , y n ) S ]R mxn i s the observation matrix, and / is a random function uniformly 
distributed over the packing set. The quantity I(y;f) is the mutual information between y and 
/, and a key step in the proofs is obtaining good upper bounds on it. 

Let Ff (respectively F g ) be the distribution of y given that f* = f (respectively /* = g). 
The mutual information I(y; f) is intimately related to the Kullback-Leibler (KL) divergence 
between Ff and F g , which is given by 

D(F f || F g ) = J Pf (y) log ^^dy, (63) 

where pf and p g are the densities with respect to Lebesgue measure. Our analysis requires upper 
bounds on this KL divergence, as provided by the following lemma: 

Lemma 14. Assume that ||/||$ = ||<?||$- Then the Kullback-Leibler divergence is upper bounded 
as D(F f \\F g ) < n|l/ ; gl1 ^ . 

See App endix IE . 1 1 for the proof. 
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7.2 Proof of Theorem d 

We are now ready to begin the proof of our lower bounds on the minimax error in the (semi)- 
norm || • ||$. In order to leverage the lower bound ([62]) . we need to have control on the packing 
and covering numbers in this norm: 

Lemma 15 (Packing/covering in || • ||<j,-norm). Suppose that the kernel matrix K has polynomial- 
ex decay. 

(a) Suppose that m < (cqu)^ for some constant c$. Then there exists a collection of functions 
{/\ . . . , f M } contained in M n (l) such that M > A m , and 

T 2 



'''I* = Ta~ and Wf l -P\\l>^^ foralli^je{l,2,...,M}. 
Ion 64n 



(b) The covering number of the set Ra( < l ) *) n%(l) in the \\ ■ \\§-norm is upper bounded as 

logiV$(e) < ci(l/e)«. (64) 

In the other direction, if e 2 > for some constant k\ > 0, then the packing number is 
lower bounded as 

logM$(e) > 02(l/e)-. (65) 

The proof of this auxiliary result is given in Appendix IE. 21 We now use it to complete the proof 
of Theorem [3j 

7.2.1 The case of time sampling 

Let us consider part (a) first. Recall that in this case a m = o^l^pm. First, supposing that m < 
(con)^, we establish a lower bound of the order 1/non the minimax risk. (Note that if this upper 
bound on m holds, then the 1/n term is the minimum of the two terms in Theorem [3j a).) Let 
j/ 1 , . . . , / } be the collection of functions from part Lemma [ToTa) . Using the Fano bound (j62|) 
and the inequality log M > m log 4, we obtain 



inf sup P r \\f-r\\l> 



/ /*ei«(i) 



, n 2 ^ l >1 _ i{y\f) + log 2 



256n 



m log 4 



where y is the matrix of observations (y\, . . . , y n ) G M" ixn , and the random variable / ranges uni- 
formly over the packing set j/ 1 , . . . , f M }. By the convexity of the Kullback-Leibler divergence, 
we have 

(0 1 ^n\\f-P\\l («) n al m 



I(y;f)<- M r}_^D(¥ fi \\¥ fj ) < _ ^ - 2 < 



i 4n ' 4 ' 



where inequality (i) follows from Lemma [T4T and inequality (ii) follows from the packing con- 
struction in Lemma [ToTa). Consequently, we have 

J(y;/)+log2 < m/4 + log2 < 1 
mlog4 ~~ ?nlog4 ~~ 2 
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for all m > 2, which completes the proof. 

Otherwise, we may assume that m > (cqu) 2 ^ , under which assumption we prove the lower 

2a 

bound involving the term of order (mn) 2a+1 . (Note that this lower bound on m holds, then the 

2a a 2 2a 

(mn) 2a + 1 term is the minimum of the two terms in Theorem 0(a).) Let 5 = c 3(^^) 2a+1 f° r 
some C3 > to be chosen. Since m > (c^n)^ by assumption, some algebra shows that 5 2 > -^a, 
so that the lower bound on the packing number from Lemma [ToT b) may be applied. Combining 
this lower bound with the Fano inequality, we obtain 

■ t to rn7 f*\\2 ^ <5 2 i ^ 1 / (y;/) + 1 °g2 

mf sup F f *l\\f- f HI > —J > 1 



/ /*eB«(l) 



4 J - c 2 (l/S) 1 / c 



By the upper bounding technique of Yang and Barron [33], the mutual information I(y;f) is 
upper bounded by inf^o {v 2 + log Nkl(v)}, where ./Vkl is the covering number in the square- 
root Kullback-Leibler (pseudo)-metric. By Lemma [T4l and Lemma [ToTb) . we have A^kl(^) < 
d( JSL- u) l ^ a . Re-parameterizing in terms of e 2 = ^v 2 , we obtain the upper bound 



I(y;f)<ina^e 2 + Cl (l/e)^} < (I) 



l/a 



e>0 Uq 



2 2a 

where e 2 = c^i—^A 2a+1 for some constant C4. Consequently, we have 

= 7(y;/) + log2 (^) 1/Q +log2 
c 2 (l/8) 1 / a ~ (l/5y/ a ' 

Note that 5 and e* are of the same order. By choosing the pre-factor C3 sufficiently small, we 
can thus guarantee that the ratio R is less than 1/2, from which the claim follows. 



7.2.2 The case of frequency truncation 

1 ^2 2a 

Recall that in this case o m = ao. Since by assumption m > (con) 2a + 1 , letting 5 2 = 2a+1 , 

we have 5 2 > after some algebra. Hence, the lower bound on the packing number from 

Lemma [ToTb) may be applied. Moreover, we have Nkl(v) < 01(7^1/) . The rest of the proof 
follows that of part (a) . 



7.3 Proof of Theorem [4] 

On one hand, no method can estimate to an accuracy greater than ^iV m ($). Indeed, whatever 
estimator / is used, the adversary can always choose some function /* such that *£(/*) = 0, 
and ||/ — /*||x,a > ^N m ($). To see this, note that on one hand, if ||/|| L 2 > ±N m (<S>), then the 
adversary can set /* = 0. On the other hand, if ||/||/,2 < ^N m (&), then for any 5 > 0, adversary 
can choose a function /* G Ker(<3>) n%(l) such that ||/*||l2 > N m (®) - 5, by definition (JTJj) 
of JV ro ($). We then have ||/* - f\\ L 2 > \\f*\\ L 2 - \\f\\ L 2 > ±iV TO (#) - 5 where we let 5^0. In 
addition, it follows from the theory of optimal widths in Hilbert spaces [26J that N m (&) £3 /i m +i, 
thereby establishing the m~ 2a lower bound for a kernel operator with polynomial-a decay. 

2a 

Let us now prove the lower bound involving (mn) 2a + 1 in part (a). This term is the smaller of 
the two terms involved in the minimum, when m > ; this is the only case we need to consider 
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as for m < n^, the minimum is n~ l which is dominated by the term m~ 2a . We introduce 
the shorthand 'J/™ = spanj^i, . . . , ip m } Pi B%(1), corresponding to the intersection of the unit 
ball B%(1) with the m-dimensional subspace of T~L spanned by the first m eigenfunctions of the 
kernel. For this proof, our packing/covering constructions take place entirely within this set. 
The following lemma, proved in Appendix IE. 31 provides bounds on these packing and covering 
numbers: 

Lemma 16 (Packing/covering in || • || £ 2-norm). There is a universal constant c\ > such that 

IogJV £a (€;*f)<ci(l/e) = . (66) 

In the other direction, if e 2 > for some constant k\ > 0, there is a universal constant c 2 > 
such that 

logM L2 (e;^)> C2 (l/e)i. (67) 

2a 

Based on this lemma, proving a (mn) 2a + 1 bound is relatively straightforward, once again 

a 2 2a 

using Fano's inequality (|62|) . Choosing 5 = C3(^) 2a+1 for a constant C3 to be specified, we 
construct a 5-packing in || • || £ 2 norm, of size M such that logM > C2(l/S) 1 ^ a . As in the proof 
of Theorem El we upper bound the mutual information in terms of the covering number in the 
|| • 11$. By condition (B2), this covering number is upper bounded (up to constant factors) by 
the covering number in the || • || £ 2-norm. To see this, note that for any / € ^ ^L 2 ( e )> we 
have / = Yl r j=i a j' l l , ji w ith SjLi a j/^j ^ 1 an d X)j=i a j — £2 - Then, condition (B2) implies 
||/||| = ( a ,^o) < ci||a||l < 2e 2 , that is / € *f n B$( x /cle). Finally, by LemmattU the || • \\ L 2 

1 / 2q 

covering number scales as (1/e) > a , so that the same calculations as before yield the (mn) 2 «+ 1 
rate as claimed. 

1 

The proof of part (b) is similar. We only need to consider the case m > n 2 ^ 1 . The rest of 
the argument follows by taking 5 2 = 63(^7-) 2a+1 and recalling that a m = o~o in this case. 



8 Discussion 



We studied the problem of sampling for functional PCA from a functional-theoretic viewpoint. 
The principal components were assumed to lie in some Hilbert subspace % of L 2 , usually a RKHS, 
and the sampling operator, a bounded linear map $ : H — > M m . The observation model was 
taken to be the output of $ plus some Gaussian noise. The two main examples of $ considered 
were time sampling, [&f]j = f(tj), and (generalized) frequency truncation [&f]j = {tpj,f)L 2 - 
We showed that it is possible to recover the subspace spanned by the original components, by 
applying a regularized version of PCA in M m followed by simple linear mapping back to function 
space. The regularization involved the "trace-smoothness condition" (|18p based on the matrix 
K = <!><I>* whose eigendecay influenced the rate of convergence in M. m . 

We obtained the rates of convergence for the subspace estimators both in the discrete domain, 
M m , and the function domain, L 2 . As examples, for the case of a RKHS 7~L for which both the 
kernel integral operator and the kernel matrix K have polynomial-a eigendecay (i.e., /ij x fij x 
j~ 2a ), the following rates in ffS'-projection distance for subspaces in the function domain were 
worked out in details: 

time sampling frequency truncation 

(J_)2aTT + (±) 2a (1)2^1 I (J_\ 2a 

\ mn I \m) In/ V m I 
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The two terms in each rate can be associated, respectively, with the estimation error (due to 
noise) and approximation error (due to having finite samples of an infinite dimensional object). 
Both rates exhibit a trade-off between the number of statistical samples (n) and that of functional 
samples (m). The two rates are qualitatively different: the two terms in the time sampling case 
interact to give an overall fast rate of n _1 for the optimal trade-off m x n^, while there is no 
interaction between the two terms in the frequency truncation; the optimal trade-off gives an 

la 

overall rate of n 2< *+ 1 , a characteristics of nonparametric problems. Finally, these rates were 
shown to be minimax optimal. 
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A A special kernel 

In this appendix, we examine a simple reproducing kernel Hilbert space, corresponding to a 
Sobolev or spline class with smoothness a = 1. We provide expressions for various approximation- 
theoretic quantities appearing in our results, such as -D m (<3>), N m (&) and Further background 
on the calculations given here can be found in the paper [2]. 

Let us consider the time sampling model ([8]) with uniformly spaced points tj = j/m for 
j = l,...,m. Elementary calculations show that K = (m~ 1 K(t,, tj)) = ^LL T , where L £ 
l mxm is lower triangular with all the nonzero entries equal 1. It can be shown that the eigenvalues 

-1/2 _-. 

of K are given by jlk ■= {<im 2 sin 2 ( 2m+i ) } for fe = 1, 2, . . . , m. Using the inequalities f a? < 
sin(x) < x, for < x < 7r/2, we have 

/2m + l\ 2 _ 7r 2 /2m + l\2 

showing that is a good approximation of even for moderate values of m. 

Recalling the definition of ^ G jg>mxm f rom Section T4.21 it can be shown that it takes the form 
^ = l m + ^I a lJ, where h G K m is the vector with entries [I s ]j = Since A max (^) = 2, 

condition (B2) is clearly satisfied. 

Now we consider the quantity D m {Q); by Lemma [21 it suffices to bound the operator norm 
of K- 1 / 2 ^ 2 - G)^- 1 / 2 . Some algebra shows that K 2 - 9 = mT^{\hh T + \m 2 K), where 
h = (1, 2, . . . , m), so that 

D m ($) = \}K-V 2 (K 2 -e)K-V% = * h T K-'h+ = J-+ * < I. 

Finally, it can be shown that A r m (<I>) ^ -^jr ; see the paper [2] for details. 

B Auxiliary lemmas 

Here we collect the proofs of various auxiliary lemmas. 
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B.l Proof of Lemma [T] 



The space Ra(<£*) is finite-dimensional and hence closed, which guarantees validity of the well- 
known decomposition % = Ra(<3?*) Ker($). In particular, for any / G T~L, there is a E W 1 and 
f 1 - G Ker(<3?) such that / = ®*a + f 1 . Then, <3?/ = Ka, and 



\h > \\**a\\n = {$*a,<S>*a) H = (a,$$*a) Rm = (Ka,Ka) K = \\$f\\ 2 K . 
Equality holds iff f 1 - = which gives the desired condition. 

B.2 Proof of Lemma [2] 



By a well-known result, for a symmetric matrix, the numerical radius is equal to the operator 

norm. Thus, we have \\K — K~ l / 2 QK~ l / 2 \2 = sup agIR m\{ }. - — |Hp~ ~" Making the 

substitution b = K~ l / 2 a, or equivalently a = i^ 1//2 6, we obtain 



1/9 i/9,„ \b T (K 2 -e)b\ 

\\K-K- 1 ' 2 eK- 1 l 2 \\ 2 = sup ' ; ' 

6eM™\{0} b Kb 

Now define the function / = S Ra( < l ) *). With this definition, we have the following equiva- 
lences: 

m 

b T Kb=\\m\\ 2 i = \\f\\ 2 H , b T K 2 b=\\$f\\ 2 = ll/Hl, and b T @b = || £ 6^ ||| 2 = ||/||| 2 , 

i=i 

from which the claim follows. 



B.3 Proof of Lemma [3] 

The (truncated) QR decomposition [T3] of Z* has the form Z* = Z*R, where Z* G K(K m ), 
and -R G ]R rXT ' is upper triangular with nonnegative diagonal entries. By construction, we have 
Ra(Z*) = Ra(Z*). Moreover, from the trace smoothness condition (|18p . we have 

r 

rp 2 > ^\\z*\\ 2 K = tr {{Z*) T K- l Z*) > X min (R T R)tv ((Z* f K^Z*) (68) 

3=1 

where the final inequality follows from the bound (|90p in Appendix |TJ Recalling the defini- 
tion ([251), w e have C m {f*) = \\{Z*) T Z* -l^ = \\R T R-I^. Since Xj(R T R) = X j (R T R-I r ) + 1, 
we have 

max \Xj(R T R) - 1| < \\R T R - I r \\ 2 < r\\R T R - 1^ = rC m {f). (69) 

3=1,— ,r 

Since rC m (f*) < ^, we conclude that X m i n (R T R) > |. Combined with our earlier bound 
we conclude that Z* indeed satisfies the trace-smoothness condition. 
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B.4 Proof of Lemma [5] 

We only need to consider the case v = 1; the general case follows by rescaling. Consider the 
following local one-sided version of D m ($), 

U loc {e;$):= sup ||/||| 2 . (70) 

/6Ra(#*), 
II/II«<1. 

ll/ll!<^ 

Using an argument similar to that of Lemma (|70f) is equivalent to 

[/ loc (e;$) = sup b T K-^QK-^H. (71) 

fe T 6<l, 

Using Lagrange duality, we have 

U loc (e;<S>) < mf [max (A^tf- 1 / 2 ^- 1 / 2 - tK),0) + te 2 ] 

< c e 2 (72) 

since (Bl) implies X^K-^QK' 1 / 2 - c K) < 0. 

For / € %, let / = g + f be its decomposition according to % = Ra(<3?*) © Ker(<3>). Then, 
+ Wf^ = \\fWl < 1 and ||/|| 2 2 < 2||<?|| 2 2 +2||/ ± || 2 2 . Hence, we obtain 

R m (e; 1) < 2U loc (e; $) + 2iV m (*). (73) 

Combining (|72l) and (|73j) proves the claim. 



B.5 Proof of inequality (16 lj) 

By polarization identity and some algebra, 

C m (/*)<2p 2 sup HI/HI -\\f\\ 2 L2 \ 

II/II«<1, 11/11^ = 7^ 

Let / = 5 + /- 1 be the decomposition according to f e H = Ra(<&*) + Ker($). Let / G B^(l) 
and = ^=-. Then, as in Appendix IB. 4( we have G B-^(l). Hence, 

2 

4- " 




< Dm($) 



where we have define a,b > as above for simplicity. Let d := ||/ ||x,a. By triangle inequality, 
b < a + d and |a — b\ < d. Then, 

\a 2 -b 2 \ = |a-6|(a + 6) < d(2a + d) < + l)jV OT ($), 
since a = and d < iV m (<3?) < 1, by assumption. 
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C Proofs for Theorem [T] 

In this appendix, we collect the proofs of various auxiliary lemmas involved in the proof of 
Theorem [TJ 

C.l Derivation of the bound (EH) 



From the CS-decomposition (144j) . we have Z 1 Z* = C, and hence PZ* = ZC — Z* = EC — 
Z*(I r — C). From the decomposition ()43|) . we have 

((P, Ai)) = atr \WSR T (Z*) T P + Z*RSW T P] 
= 2a tr [RSW T PZ*] 

= 2a{ tr [RSW T EC] - tr [RSW T Z*(I r -C)]\, 

where we have used the standard facts tx{AB T ) = tv(A T B) and tr (AB) = tr(BA). For the first 
term we have 



tr [RSW T EC] 



j,k=i 



RjkSkiwk^cj < ( R lk) (J24c 2 j ((wk,ej)) 



j,k=l j,k 



where we have used Cauchy-Schwarz. By ([69]) . under the assumption rC m (f*) < I, we have 
ti(R T R) < |r. We also have < Sk < si = 1 and < Cj < 1 for j, k = 1, . . . , r. It follows that 

tr [RSW T EC]\ < y|vF( £ ((^fe,^)) 2 ) 172 < y^r^maxlH,^-)!. 

For the second term, using a similar argument by applying Cauchy-Schwarz, we get 

rr /j^ \ i/2 



— 7" 7" 



i=i fc=i 

- yf r (X]( 1 ~^) 2 ) 1 max|(TZJ fc ,i*)| < ^ r \\E\f HS max \(Wj,T k )\. 

where the last inequality follows from the fact that — cj,) 2 ) 1 ^ 2 < ^^(1 — Cj) = ^WElf^g. 

C.2 Proof of Lemma [8] 

We make use of an ellipsoid approximation (see [23]). To simplify notation, define K := (8rp 2 )K 
and /I := 8rp 2 Jl, so that we have ti(Z T K~ 1 Z) < 2rp 2 if and only if tr(Z T K~ 1 Z) < 1/4. Since 
both Z and Z* satisfy this condition, it follows that ||%||^ < \ and < | for j = 1, . . . ,r, 

where ||a|| := a T K~ 1 a. Thus, we are guaranteed that ej G £^ '■= {v G K m | \\v\\g ^ l}- 

We first establish an upper bound on the quantity sup { (Wk, v) \ v G £^ D B 2 (t)}, where 
®2(i) = G M m | ||v||2 < i} is the Euclidean ball of radius t. Let jui > ••• > Jl m be the 
eigenvalues of K in decreasing order and let Jl := (fii, . . . ,Jl m ). Since for ?7 G V m (W m ), the 
random vectors TZJfc and UWk have the same distribution, it is equivalent to bound the quantity 
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sup{(w k ,v) I v G ^nl 2 (t)}. Now for v G ^nl 2 (t), we have XXi^~V < 1 and 
^2^Lit~ 2 vf < 1 implying Y^=i maxjpt^ 1 , t^ 2 }^ 2 < 2. Consequently, if we define the modified 

2 

ellipse £ 7 := {v e M. m \ YT=i ^ < l} wh ere 7i := 2 min{i 2 , ft}, then we are guaranteed that 
v £ £ 7 , so that it suffices to upper bound sup„ gf ^ (Wk,v). For future reference, we note that 

IMIi = WF 2 (t/V8), and ||7||oo<2t 2 (74) 

where T was defined previously ([24"]) . Define the random variables Wk '■= ^YHi=i w ifiik and 
-Bfcfc := ^ SILi Atc- F° r eacn index k, Lemma PT71 (see Appendix |F|) . combined with the rela- 
tions ([71]), yields 



a sup \(W k ,v)\ <aB 1 £{J^ + 8 < C 1 ^ / fc 2 ^{J-(t/^) + 5t} 

with probability at least 1 — exp(— 5 2 /2). Taking 5 = A ry /nt/a, where A r := Kr -3 / 2 for some 
small enough constant k > 0, we obtain 



a sup |(Sj fc ,u)| < Ci^fl^JWVs) + A r t 2 } 
ue£ 7 L v n J 



with probability at least 1 — exp (-A 2 r nt 2 /2a 2 ). 

As was mentioned earlier, the same bound with the same probability holds for sup {cr| (lIJk,v) \ 
v E £j^nB2(t)} . Since ej G j = 1. .... r we can apply the technical Lemmal20lof AppendixlHl 
with v = (n, m), 0^ = A r n/a 2 and = e m ^ n to obtain 

ffKWfc.ej)] < Ci^ 2 {^.F(2||e j || 2 /v / 8) + A(2||e J || 2 ) 2 + -£=.F(2e m , n /\/8) + A(2e m , n ) 2 }, 



for all j G {1, . . . , r}, with probability at least 1 — c\ exp(— A 2 n n /2a 2 ). Note that \\ej || 2 < 
HI HI J pi"^ , j = 1, . . . ,r. Since the bound obtained above is nondecreasing in ||e^-||, we can replace 
||ej|| everywhere with |||-E|||.h-s. We also note that by Xn concentration [13 US], we have Bkk < 3/2 
with probability at least 1 — exp(— n/64). Finally, by definition of e m>n and monotonicity of T 
we have -J- F{2e m , n /^) < -^F{ £m,n) ^ -^r^rnn- Putting together the pieces, we conclude that 

m^a\{w k ,e j )\<C 2 {^ r F{iE\\ H s) + A r ||| E\f HS + A r e 2 }, 
j,k L yjn ) 

with probability at least 1 — c\r exp(— A 2 n e 2 ^ n /2a 2 ) — r exp(— n/64), where we have used union 
bound to obtain a uniform result over k. 

C.3 Proof of Lemma [9] 

We control terms of the form (Wk, Zj) using Lemma[T7]in AppendixIFJ this time with Wi replaced 

with (wi, Zj) and 7 = 1 (i.e., we are looking at sums of products of univariate Gaussians). Thus, 

1/2 

for any fixed j and k, we have cr\(wk, z*)\ < a B kk |-y= + 5-^}, with probability at least 

1 -exp(-5 2 /2). Taking 5 = Kr ^^fnjo, then the event max/% B k k < 3/2, which we have already 
accounted for, we have by union bound 



max ar\ {Wk, Zj 

with probability at least 1 — r 2 exp(—K 2 r~ 2 n/2a 2 ). The second inequality follows by our assump- 
tion r < K\/n/a. 
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C.4 Proof of Lemma [TUl 

For each j £ {1, ...,r}, we define the vector Q 3 := Wz* £ W 1 so that ( 3 = (C/) where 
Cf = w T^j- We can use the same ellipsoid approximation as Appendix IC.2I — that is, we 
first look at sup {Ti(u; z*) | u G £^ n B 2 (t)} and then argue that it is enough to bound 

sup^ g ^ Ti(v,Zj) = swp ve£ (v, ^Y17=iCi w i ~^*j)-> ^ ue *° the invariance of the underlying dis- 
tribution under orthogonal transformations of v. Now applying Lemma [18] from Appendix IPl 
yields 



a 2 supTifa^) < (^ + l){4lv^Mk + ^ 2 ^MU) (75) 



•lie 



with probability at least 1 — 2 exp(— Recalling that by assumption a < oq, let A r = kt 1 . 
For i < <7(To/^4 r , take J = ^4 r t/(cr<7o) < 1- Then, using ([73]) . the left-hand side of ([75]) is bounded 
above by 

d (M 2 - + l) {<r ^= .F(t/v/8) + 2 r i 2 } (76) 



with probability at least 1 - 2exp ( — A 2 n t 2 / (16 a 2 ctq)) . For t > aa /A r , take 5 = A r t/a 2 . In 
this case, A r t > aao > a 2 implying 6 > 1. Then, the left-hand side of ([75]) is again bounded 
above by (|76|) . this time with probability at least 1 — 2exp(— A r nt/(16cr 2 )). Assuming k < 1, 
which is going to be the case, we have A 2 < ^4 r . Combining the two cases, we have the upper 
bound ([76]) with probability at least 

1 - 2exp{ - A 2 n(a 2 A 1) (t A i 2 ) / (16c 2 )} (77) 

* v ' 

Pl(t) 

for all t > 0. (Note the break-up into two cases was to obtain a dependence of o~ 2 in the 
probability exponent for all t > 0.) 

By an argument similar to Appendix IC.21 — that is, using technical Lemma [20] — we have 
||e?||2 < l-^ I and -^J r (2e mjn /v / 8) < 2 r e^ n from the definition; we obtain 

(T^si*) < C 2 (^ + l){a ^(p| ff5 ) + 2,1^11^ + 2 r 4 n } 
v» V n 

for all j G { 1 , . . . , r} , with probability at least that of ([77]) with i = e mj „ and 2 replaced with 
some constant c 2 > 2, i.e. 1 — C2Pi{e m . n ). By concentration of x 2 variables and union bound, 
we have maxy n -1 ||C J ||| < 3/2 with probability at least 1 — rexp(— n/64). Putting together the 
pieces, we conclude that 

r 

a 2 T^-z*) < cAao^rFdEWus) + n\\Ef HS + K e 2 m A 

3=1 Vn 

with probability at least 1 — C2Pi(e mjn ) — r exp(— n/64), as claimed. 



32 



C.5 Proof of Lemma [Til 

As before, the problem of bounding ^(e^-) can be reduced to controlling swp ve£ T%(v), by 
invariance under orthogonal transformation. Applying Lemma [19] of Appendix [G] with with 
5 = Ky/nt/a yields 



a sup 



y/TAy)= SU p ^||T4^|| 2 < jyEk + (l + K-)^MU 



< c , i{-^=J'(t/\/8) + (rt + Kt 2 } 

< Ci{-^=J'(t) + at + v^k*} 



with probability at least 1 — exp(— K 2 nt 2 /2a 2 ), valid for all t < v2, Note that since 1 1 e^- 1 1 2 < 
(by proper alignment), it is enough to only have a bound for t < \/2- Recall the assumption 
(Al), a < yfTi and by (A3), ^ J 7 ^) < y'Tci. Assuming k < 1, we obtain 

a 2 sup T 2 (v) < C\{2^J~kt + V2~Kt) 2 < C 2 nt 2 

»££y 

with the same probability. As before, applying technical Lemma [20l this time with t„ = 

?" _1 ^ 2 em,n, we obtain 

a 2 T 2 (e 3 ) < C 2 K{(2||e,|| 2 ) 2 +(2^) 2 } 
for all j € {1, . . . , m} with probability at least 1 — C3 exp(— «; 2 r _2 ne 2 n n /2cr 2 ). Thus, we have 

r 

i=i 

with probability the same probability. Note that we have used |||-E'||| 2 ^5 = J2j ll^illi- 

D Proofs for Theorem [2] 

In this appendix, we collect the proofs of various auxiliary lemmas involved in the proof of 
Theorem [2j 

D.l Proof of Lemma 1121 

By definition, each hj lies in J, so that we have hj = ^* (^2 li BijK^ l Zi) for some B € R rX7 \ 
Recalling that [K~ 1 ZB]j denotes the j-th column of K~ 1 ZB, we can write hj = <&*[K~ 1 ZB]j. 
Recalling the formula © for the adjoint, observe that for any a, b € M. m , we have 

($*a, <5>*b) L 2 = (T,i a i l PiiY.j b j i Pj)L2 = Y.i,j a i b j( i Pi^j)L^ =a T @b (78) 

where G = ({<pi, (fj)L 2 ) £ as previously defined in Lemma[U Since the functions {hjYj=\ are 
orthonormal in L , we must have (hj,h k ) L 2 = [K^ Z B)J Q[K~ X Z B)^ = 6 jk , or in matrix form 
(K~ 1 ZB) T Q(K~ 1 ZB) = I r xr- This condition can be re-written as 

B T QB = I rxr = I, where Q := Z T K- 1 @K~ 1 Z. 
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Since hj-gj = <fr*[K~ l Z{B - I r )]j, we have \\hj -gj\\ 2 L2 = [K~ l Z{B — I r )] r jQ[K^ l Z{B - I r )]j, 
using the definition of 0. Consequently, we obtain 

r 

z2\\hj-9j\\L2 =tr{{K- 1 Z{B-I)) T Q{K- 1 Z{B-I))} = tr {i + Q - 2QB} , 
3=1 

using the symmetry of Q and the constraint B T QB = I. Subject to this constraint, we are free 
to choose B as we please; setting B = Q~ l l 2 yields 



E fh, - = ^ { (/ - Q 1/2 ) 2 } = fl- Q^fns- 

3=1 



In order to upper bound — Q 1 I 2 \hs, we first control the closely related quantity \I— Q|_f/5- 
We have 

\\Ir-Q\Us = \tZ T K-V 2 (K-K-V 2 eK-y 2 )K-y 2 Z\l HS 

< \\K - if" ^Gif" 1/2 1 2 I^K^ZIhs 

< 2rp 2 (79) 

where we have used inequality ()93|) . Lemma[2j the trace-smoothness condition ti{Z T K~ l Z) < 2rp 2 , 
and the inequality |||M|||hs < tr(M), valid for any M y 0. 
In order to bound ||I — Q 1 ^ 2 !//^, we apply the inequality 

jA q -D q l<qa q - 1 jA-Dj, < q < 1, (80) 

valid for any operators A, D such that A y al and D y al for some positive number a, 
where ||| • ||| is any unitarily invariant norm. (See Bhatia [5], equation (X.46) on p. 305). As 
long as 2rp 2 D m (Q) < 1/2 so that the bound (I79p implies that Q y 1/21, we may apply the 
inequality (|8Q|) with A = I r ,D = Q,a = q = l/2 and ||| • ||| = ||| • \\hs so as to obtain the inequality 
I -^r — Q 1 ^ 2 \\hs < III Ir ~ Q\hs-i which completes the proof. 

D.2 Proof of Lemma H3J 

By definition, we have h* = J2 r i=i E ijfi f° r some E G W xr . Since both {h*} and {/*} 
are assumed orthonormal in L 2 , the matrix E must be orthonormal. In addition, we have 
£J=i \\h*\\ 2 n = £ -=i \\f*\\ 2 n < rp 2 , implying that 

r r 

E 11*3 - 9j\\n < 2 E (n^iiw + W\h) ^ 2 (^ 2 + 2r p 2 ) ^ 6r p 2 

3=1 3=1 

where we have used the fact that Ylj=i WdjWn = £j=i II^IIk — %rp 2 . 

Recall the argument leading to the bound (|56|k applying this same reasoning to the pair 
(h*,g*) with the choices A 2 = YJj=\ \\K ~ g)\\% and B 2 = 6rp 2 leads to 

r r 

E W h i ~ 9jWh <ciE W h *3 - 9j\\l + Qrp 2 S m (t>). 

3=1 3=1 
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It remains to bound the term 5^5=1 W^j ~ 9j\\%- Recalling that <&/* = z* , we note that 
®h* = Ej E ijZ * = [Z*E]j. It follows that ££ =1 [|ft--fl£||| = \\Z*E-Z*\\% 3 . Since Ra(Z*) = Ra(Z*), 
there exists a matrix i? G M rxr such that Z* = Z*R. Letting ViTlf denote the SVD of R, we 
have 

— Z*\\hs = ll-R-S — IrWns = ll-R _ ^ T ||i?5 = \\Vl — E T \\ HS , 

where we have used the unitary invariance of the Hilbert-Schmidt norm. TakeE T = V x Vj which 
is orthogonal, hence a valid choice. By unitary invariance, we have \\Z*E — Z*\\hs = ||T — I v \\hs- 
We now apply inequality (|BU|) with a = q = 1/2, A = Y 2 , D = I r , ||| • ||| = ||| • \\hs- The condition 
r C m {f*) < \ implies T 2 >z \l r . (See Appendix IB. 31 in particular the argument following (f69l) .) 



Consequently, we have ||T — I r \\HS < 1 1 TC 2 — I r \\HS < -^\\Z* T Z* — I r \\HS, where we have used 

y 2 T 2 y 2 T = R T R = Z* T Z*. Recalling that \\Z* T Z* - I r \\ HS < rC m (f*) and putting together the 
pieces, we obtain the stated inequality ([59]) . 



E Proofs for Theorems [3] and [4] 

In this appendix, we prove various lemmas that are involved in the proofs of the lower bounds 
given in Theorems [3] and 

E.l Proof of Lemma 

Let us introduce the shorthand notation u = <!?(/) and v = &(g)- Under the model P/, for each 
i = 1, 2, . . . , n, the vector yi G M. m has a zero- mean Gaussian distribution with covariance matrix 
Sj := uu T + a'^ n I. Similarly, under the model P 9 , it is zero-mean Gaussian with covariance T> g := 
vv T +a 2 a I. Since the data is i.i.d. and using standard formula for the Kullback-Leibler divergence 
between multivariate Gaussian distributions, we have ^D(Pf || P s ) = log d g ts ° + tr(S~ 1 Sj) — m. 
Since ||u||2 = IHI2 by construction, the matrices £/ and £ s have the same eigenvalues, and so 
the first term vanishes. Using the matrix inversion formula, we have 

-D(P f || ¥ g ) + m = (((all + vv T )-\ a 2 m I + uu T )) = ((a^ 2 I - a~ 4 - ^ _ 2 , a 2 m I + uu T )) 
n l + \\v\\^(T m 

and some algebra, using the fact that \\u\\2 = IHI2 = a implies |(u,v)| < a 2 , yields the claim. 
E.2 Proof of Lemma [15] 

As previously observed, any function / G Ra(<I>*) n B%(1) can be represented by a vector in the 
ellipse £ := {6 G M m | YJj=\ Q )lh < 1} such that 11/11* = \\ 9 h- The proofs of both parts (a) 
and (b) exploit this representation. 



(a) Note that the ellipse £ contains the ^T-ball of radius y /I m . It is known [22] that there 
exists a 1/2 packing of the i^-ball which has at least M = 4 m elements, all of which have unit 
norm. By rescaling this packing by we obtain a collection of M vectors {9 1 , . . . , 9 M } such 
that 

2 2 

110*11! = ^ and [|0*_0*[||> £°, for all* ^ j G [Ml. 
n An 
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The condition m < (con)^ implies that \\0 l \\\ < (coa 2 )m~ 2a < fl m , where the second inequality 
follows since by assumption (Al) we can take cr 2 . sufficiently small. Thus, these vectors are also 
contained within the ellipse £ , even after we rescale them further by 1/4, which establishes the 
claim. 

(b) This part makes use of the elementary inequality 

(0 («) 

klogk-k < 2_,logj < (fc + l)log(fc + l) -(fc + 1). (81) 

3=1 

We use known results on the entropy numbers of diagonal operators, in particular for the op- 
erator mapping the ^2-ball to the ellipse £. By assumption, we have j2jj 2a £ [eg, c u ] for all 
j = 1, 2, . . . , m. By Proposition 1.3.2 of [9j with p = 2, we have 

1 k 

logiV$(e;£) < max {- Vlog/Jj + fclog(l/e)} + log 6 
fe=l,2,...,m z £ — ' 
3=1 

k 

< max { - a^log j + Hog(l/e)} + log(6c u ) 

fc=l,2,...,m * — ' 
3=1 

< max /(A;) + log(6c„), 

l<k<m 

where f(k) = a(k — klogk) + Hog(l/e). Since f'{k) = —a log A; + log(l/e), the optimum is 
achieved for k* = (1/e) 1 /", and has value f(k*) = a(l/e) 1 / a , which establishes the claim. 
In the other direction, for all k & {1, 2, ... , m}, we have 

^ k k 
logM<j>(e;£) > -^log/I, + fclog(l/e) > -a^logj + fclog(l/e) + logQ. 

3=1 3=1 

Using the lower bound (|8T1) (tt). we obtain 

logM*(e;£) > a((fc + l) - (A; + 1) log(& + 1)) + A;log(l/e) + logQ. 

The choice k + 1 = (l/e) 1 /", which is valid under the given condition (l/e) 1 /" < m — 1, yields 
the claim. 

E.3 Proof of Lemma IT61 

Any function / in the set if!™ has the form / = X^jli a j^j f° r a vec t° r of coefficients a £ R m such 
that X]j=i a j/ < 1- If 5 = Sj=i & j j is a second function, then we have ||/ — g\\ L 2 = \\a — b\\2 
by construction. Thus, the problem is equivalent to bounding the covering/packing numbers of 
the m-dimensional ellipse specified by the eigenvalues {/xi, . . . , fJ, m }. The claim thus follows from 
the proof of Lemma fT5T b). 

F Suprema involving Gaussian products 

Given a diagonal matrix Q := diag(7i, . . . , j m ) 6 ]R mxm , this appendix provides bounds on 
|| Q 1 / 2 ^!^ where £ G M m is some random vector (product of Gaussians in particular). The 
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following bound, which follows from Jensen's inequality, is useful: 



E ||Q 1/2 el|2 < a/ E WQ ll2 t\\l = \Ar(Q£ e ), where S e := E^ T . 



(82) 



We prove a bound for the random vector £ := n 1 E™ =1 A w « 6 M m , where /3j ~ iV(0, 1), 
independent of tOj ~ N(0, I m ), and the pairs (/%, «>t) i.i.d. for i = 1, . . . , n. 



Lemma 17. For all t > 0, we /jai>e 

'IIQ 1/2 E?=i/^ll2 

«;/iere /3 = (/3i, . . .,/?„). 



> 



\/tr(Q) + t 



< exp(-t 2 /2), 



(83) 



Proof. Define 9 := j3/\\/3\\2, and observe that is uniformly distributed on the sphere S n 1 , 
independent of (uii); we use cr n_1 to denote this uniform distribution. The claim is a deviation 
bound for HQ 1 / 2 £™ =1 ^|| 2 . With held fixed, we have w := Ya=i ®i w i ~ ^(0> ^m)- The map 
u; i—)- ||(5 1//2 w||2 is Lipschitz, from ^ to M, with Lipschitz constant bounded by HQ 1 / 2 ! = V III QUI- 
Hence, by concentration of the canonical Gaussian measure in M. m , with 9 held fixed, we have 

V[\\Q 1/2 tih ~ E ||Q 1/2 ^|| 2 > tviiOl] < exp(-t 2 /2). 

Since this bound holds for all realizations of 9, the tower property implies that the same bound 
holds unconditionally. Finally, from the bound (|82p . we have E \\Q 1/2 w\\ 2 < y/ttfQ), fr om which 
the claim follows. □ 

We now turn to bounding ||(5 1 / 2 (n~ 1 rjiWi — u)\\2, where u E M. m is some fixed vector. Let 
us patch u with U2, ■ ■ ■ , u m so that {u, u 2 , ■ ■ ■ , u m } is an orthonormal basis for t™ . Let us define 



the function ( : R n \{0} ->R&s ((x) := 



With this notation, we have the following: 



Lemma 18. Let u £ S m 1 anrf assume that U := (u U 2 ) = (u u 2 
nal. Let (wi,rji) 6 M m+1 6e z.z.d. Gaussian random vectors for i = 1, 



« m )Gl mxm is orthogo- 
, . , n with distribution 



N 



m U 
,T 1 



Then for all t > 0, 



|QV2 (n -iV^-n)|| 2 >(l + ^ 

i=l v 



</l| 2) (a /MQ) +/ 



71 



< 2exp(— n 



t At z 
16 



w/iere 77 = (171, . . . ,%). 



Proof. Since the pair [wi,rj) is jointly Gaussian, vectors {u>i} conditioned on 77 = (774) are i.i.d. 
Gaussian with E [wj 1 77^] = ijiU and cov(uij | r/j) = L m — uu T . Consequently, conditioned on 
77, the variable w v := n" 1 YliVi w i — u is Gaussian with mean u(n _1 ||7/||2 — 1) and covariance 



n 



uu 



Consequently, for w v := w^/in 1 ||r/|| 2 ), we have 



U T w v ~ N 





Im-1 
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where we have used U T u = ( ). Note that U T w r) is actually a degenerate Gaussian vector, so 
that we can write U T w v = w ')i f° r some w' ~ N(0, I m -i). 

Defining Q := U T QU, we have 

-C(»7)- 



\\Q 1/2 w v h = ||f/ T Q 1/2 C/C/ T ^|| 2 = ||Q 1/2 U T w v \\ 2 = Q in 



The map w' i— > \\Q 1 ^ 2 ] |L 1S Lipschitz, from 1 to R, with Lipschitz constant bounded by 

|||Q1/2||| = |Q1/2| = 



. By concentration of canonical Gaussian measure in 



im-i 



, we have 



>\}\QV 2 w r ,\\2-E\\Q 1 ' 2 wj2>ty/IM I V] < exp(-t 2 /2). 



Define the function k(t)) := ((Q, I m + (( 2 (rj) — l)uu T )) . Applying the inequality ([82]) with 
£ = an d Q instead of Q, we obtain 



E ||Q 1/2 ^|| 2 =E Q 1 



/2 



CM 



< 



{tr (Q 



C 2 (r?) 



L 



m— 1 



)} =v^M- 



Since Q ^ 0, we have 

k(t/) =trQ+ [C 2 ( V ) - l]u T Qu < trQ + ( 2 (r])u T Qu < tr(Q) + C 2 

Applying the inequality + & < \fa + Vb yields ^/ n(r]) < y/tr(Q) + \C(v)\\/\lQ\l- Consequently, 
we have shown the conditional bound, 



\\Q 1/2 (^EIU 



>V^Q) + ^t+\av)\)VlQl V < exp(-nt 2 /2). (84) 



n • L ||77||2 

By x 2 -tail bounds, we have P[jM^ — 1| > t] < exp(— n^p). Conditioned on the complement 
of this event, we have |C( ? ?)| < n^ll^h ; an d hence conditioning also on the complement of the 
event in bound (|84|) . we are guaranteed that 



i=l 



HQ 1 ^ (n -i £ _ n) || 2 < n-iy^yJ V ^(Q) + + — ^ T )v / fQl} 

< (1+ Nj )( ^(« 

with probability at least 1 — 2exp(— n- 



+ t 



, tAr 
16 



□ 



G Bounding an operator norm of a Gaussian matrix 

Given a sequence positive numbers {7i}£Li, consider the £ 7 := {v G R m : El^i %~ u f — ^ n 
this appendix, we derive an upper bound on the operator norm of a standard Gaussian random 
matrix W G R nxm , viewed as an operator from R m equipped with the norm induced by £~, to 
R n equipped with the standard Euclidean norm || • [ 1 2 . 

Lemma 19. Let W G R nxm &e a standard Gaussian matrix. Then for all t > 0, 

, , +2 

P[ sup ||Wt;|| 2 > v^Mk + (v^ + t)Vll7l|oo] <exp(--). (85) 
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Proof. Let S* n_1 := {u G W 1 \ \\u\\2 = 1} denote the Euclidean unit sphere in M. n . Defin- 
ing S = {s = (u,v) | u E S"™ -1 , v G £ 7 }, consider the Gaussian process {Z S } S £$ where 
Z s = ((W, uv T )). By construction, we have sup^g^ ||Wu||2 = sup sgiS Z s . Our approach is to use 
Slepian's comparison for Gaussian processes [21] in order to bound E[sup s£iS Z s ] by E[sup<, gi 5 X s ], 
where X s is a second Gaussian process. Concretely, we define X s := y|py||^(u, g) + (v, h), where 
g and h are independent canonical Gaussian vectors in M n and R m , respectively. Let s = (u,v) 
and s' = (it', v') belong to S; by an elementary calculation, we have 

E [{Z s - Z s ,)] 2 = \\uv T - u'v' T f n s < hlloolln - u'\\\ + [|v - v'f 2 = E [(X s - X s *) 2 }, 

Consequently, we may apply Slepian's lemma to conclude 

E[sup Z s ] < E[sup X s ] = Vll7lUE[ sup (u,g)] +E sup (v,h) 

= Vh^(^\\9h)+^\\Q l/2 h\\ 2 
< VhWoo Vn + Vhh, 

where the final inequality follows by Jensen's inequality, and the relation tr(Q) = H7H1. 

Finally, we note that ||W||f 7) s 2 = sup^g^ ||Wu||2 is a Lipschitz function of the Gaussian 
matrix W, viewed as a vector in with Lipschitz constant y|py||oo- Indeed, it is straighforward 
to verify that sup^g^ ||Wv||2 — sup^gg ||WV||2 < \\W — W'|hs so that the claim 

follows by concentration of the canonical Gaussian measure in t™ n (e.g., see Ledoux [20]). □ 

H A uniform law 

In this appendix, we state and prove a technical lemma used in parts of our analysis. Consider 
some subset T> of W 71 . Let v be an index taking values in some index set X. We assume that 
v is indexing a collection of random (noise) matrices A u . Suppose that there is a collection of 
nonnegative nondecreasing (possibly random) functions Q u : [0, oo) — > [0, oo) such that for all 
t > and v £ 1 

P{ sup G(v; A„) > Q v (t)\ < ci exp[-c 2 6 v (t A t 2 )], (86) 
L v e£>, ||«|| 2 <t } 

where 6^, v £l are some positive numbers and G is some function. 

Lemma 20. Under i86\) and for any collection {t u } u( zx such that inf 6 v {t v At 2 ) > 0, we have 
for any v G X, 

sup [G{v-A v )-g v {2\\v\\ 2 )]<g v {2t v ). (87) 
with probability at least 1 — c\ exp[— c 2 u (t u A t 2 )]. 

Proof. The proof is based on a peeling argument (e.g., [31]). Define c := inf^gj v t 2 : and fix 
some First, note that as v varies over T>, the function v i— > \\v\\2 V t v varies over [t v , oo). 

Define, for sG {1,2,...}, 

V s :={veV : 2 s -% < (\\v\\ 2 V t v ) < 2%}. 
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We have V = (J^Li ^s- If there exists v £ V such that 

GfaA*) > &,(2|M| 2 V 2t v ), 
then there exist s £ {1,2,...} and T> s 3 v such that (|88|) holds for v. Using union bound, 



: A^) > £/„(2||«|| 2 V 2t v )\ < £>(3u € X>, : G(u, A„) > 0„(2||«|| 2 V 24 

8=1 

For v £ V s , dHHD implies 

G(v,A u ) >g u (2\\v\\ 2 \/2U) >g v {22 s ~ 1 t v ) = g v {2 s t v ) 
where we have used Q v being increasing. Since V s C {v : \\v\\2 < 2 s t u }, we conclude that 



IveV : G(v,A v ) > G v (2\\v\\ 2 V 2t v )) < VP sup G(u, A^) > G u {2 s t L 



< j^exp[-0„2*(t„ A t*)] 



from assumption (|86p. The last summation is bounded above by 



, . 1 - e- 

fc=l 



q e -e v {t v htl) _ 



We get the assertion by noting that for a, 6 > 0, ^ w (a V b) = Q v [a) V ^(6) < G v (a) + Q u {b) 
because Q v is assumed to be nondecreasing and nonnegative. □ 

I Some useful matrix-theoretic inequalities 

Fan's inequality states that for symmetric matrices A and B and eigenvalues ordered as \\{A) > 
... > \ m (A) (and similarly for B), we have tr(AB) < YliLi \(A)\i(B). As a consequence, for 
a symmetric matrix B and symmetric matrix A y 0, we have 

X miQ (B)tr(A) < tr(AB) < A max (£) tr(A). (89) 

It follows that for a symmetric matrix D y and -R E R rxr , we have 

X min (R T R)tv{D) < tr(DRR T ) = tr(R T DR) < X max (R T R) tr(Z>), (90) 

where we have used the fact that and have the same eigenvalues. 
For B y 0, we have 

A min (S)A j (i? T i?) < \j(R T BR) < X max (B)X j (R T R), (91) 

which can be established using the classical min-max formulation of the j th eigenvalue — namely 

A(C) = max min z T Cz (92) 



r-l 



X: dim(M)=j Z £ X n 5 

where the maximum is taken over all j'-dimensional subspaces of R fc . Finally, the inequality (I9ip 
implies that 

\\R T BR\\ HS < \\B\\ 2 \\R t R\\hs. (93) 
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